D6.2 (WP6): Scalability Case Studies: Scalable Sim...

69
ICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software A Specific Targeted Research Project (STReP) D6.2 (WP6): Scalability Case Studies: Scalable Sim-Diasca for the Blue Gene Due date of deliverable: 30th November 2014 Actual submission date: 27th March 2015 Start date of project: 1st October 2011 Duration: 41 months Lead contractor: The University of Glasgow Revision: 0.1 Purpose: To study how to apply RELEASE technologies to scale substantial distributed Erlang systems to large scale architectures, i.e. architectures with hundreds of hosts and up to 10 4 cores. Results: The main results of this deliverable are as follows. We have devised and implemented City-example, a full, relevant simulation for the main case study of this deliverable, which is the scalability of a demanding Erlang application, a simulation engine named Sim-Diasca. We have investigated the reliable scalability of two distributed Erlang and SD Erlang benchmarks, Orbit and ACO, on up to 256 hosts (6144 cores). We have investigated the scalability and performance of the Sim-Diasca City instance using both conventional tools, and the new BenchErl and Percept2 RELEASE tools. We have performed additional scalability investigations using RELEASE technologies, which in- cluded demonstrating the deployment of Sim-Diasca with the WombatOAM load management tool; the port of the Erlang runtime onto the Blue Gene/Q supercomputer; and an outline design of how SD Erlang could be applied to Sim-Diasca.

Transcript of D6.2 (WP6): Scalability Case Studies: Scalable Sim...

Page 1: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510RELEASE

A High-Level Paradigm for Reliable Large-Scale Server SoftwareA Specific Targeted Research Project (STReP)

D62 (WP6) Scalability Case Studies Scalable

Sim-Diasca for the Blue Gene

Due date of deliverable 30th November 2014Actual submission date 27th March 2015

Start date of project 1st October 2011 Duration 41 months

Lead contractor The University of Glasgow Revision 01

Purpose To study how to apply RELEASE technologies to scale substantial distributed Erlangsystems to large scale architectures ie architectures with hundreds of hosts and up to 104 cores

Results The main results of this deliverable are as follows

bull We have devised and implemented City-example a full relevant simulation for the main casestudy of this deliverable which is the scalability of a demanding Erlang application a simulationengine named Sim-Diasca

bull We have investigated the reliable scalability of two distributed Erlang and SD Erlang benchmarksOrbit and ACO on up to 256 hosts (6144 cores)

bull We have investigated the scalability and performance of the Sim-Diasca City instance using bothconventional tools and the new BenchErl and Percept2 RELEASE tools

bull We have performed additional scalability investigations using RELEASE technologies which in-cluded demonstrating the deployment of Sim-Diasca with the WombatOAM load managementtool the port of the Erlang runtime onto the Blue GeneQ supercomputer and an outline designof how SD Erlang could be applied to Sim-Diasca

ICT-287510 (RELEASE) 23rd December 2015 2

Conclusion We have shown the challenges of significant scalability studies even simply being able torun non-trivial applications in larger settings is difficult not to mention that the benchmarking toolsand our ability to synthesize the resulting data must scale as well We established some commonarchitectural changes that distributed Erlang applications may adopt in order to alleviate some typicalscalability bottlenecks

Project funded under the European Community Framework 7 Programme (2011-14)Dissemination Level

PU Public gtPP Restricted to other programme participants (including the Commission Services)RE Restricted to a group specified by the consortium (including the Commission Services)CO Confidential only for members of the consortium (including the Commission Services)

ICT-287510 (RELEASE) 23rd December 2015 1

Scalability Case Studies Scalable Sim-Diasca for the Blue Gene

Contents

1 Executive Summary 3

2 The main case study 521 Sim-Diasca Overview 522 City Example 6

221 Overview of the simulation case 6222 Description of the simulated elements 6223 Additional changes done for benchmarking 11

3 Benchmarks 1331 Orbit 13

311 Running Orbit on Athos 14312 Distributed Erlang Orbit 15313 SD Erlang Orbit 16314 Experimental Evaluation 17315 Results on Other Architectures 18

32 Ant Colony Optimisation (ACO) 24321 ACO and SMTWTP 24322 Multi-colony approaches 24323 Evaluating Scalability 28324 Experimental Evaluation 29

33 Performance comparison of different ACO and Erlang versions on the Athos cluster 29331 Basic results 30332 Increasing the number of messages 32333 Some problematic results 32334 Network Traffic 38

34 Summary 38

4 Measurements 4041 Distributed Scalability 40

411 Performance 40412 Distributed Performance Analysis 41413 Discussion 44

42 BenchErl 4443 Percept2 45

5 Experiments 5051 Deploying Sim-Diasca with WombatOAM 50

511 The design of the implemented solution 50512 Deployment steps 51

52 SD Erlang Integration 53

6 Implications and Future Work 55

ICT-287510 (RELEASE) 23rd December 2015 2

A Porting ErlangOTP to the Blue GeneQ 56A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI 56A2 MPI Driver Internals 57A3 Current Status of the Blue GeneQ Port 58

B Single-machine ACO performance on various architectures and ErlangOTP re-leases 58B1 Experimental parameters 59B2 Discussion of results 61

B21 EDF Xeon machines 61B22 Glasgow Xeon machines 61B23 AMD machines 61

B3 Discussion 63

ICT-287510 (RELEASE) 23rd December 2015 3

1 Executive Summary

The stated objectives of this deliverable are to ldquoport Sim-Diasca to the Blue Gene architecture addinglocality control as requiredrdquo In the event we have interpreted this more generally studying twobenchmarks in addition to Sim-Diasca and measuring them on five parallel architectures (Section 315)as outlined below

The deliverable aims to study the scalability of Erlang programs in order to be able to process largerproblem sizes while making good use of available computing resources More precisely the overall aimhere is to study

bull How Erlang programs currently scale using for that large computing infrastructures like highperformance clusters We intended to investigate the performance on the Blue GeneQ super-computer but as outlined in Appendix A the corresponding port of the Erlang runtime was onlypartly functional due to issues at the level of the networking back-end Instead we have used 5conventional clusters

bull The extent to which the scalability of Erlang programs can be improved by adopting architecturalchanges and making various software choices We compare the performance of the ErlangOTPrelease that existed at the start of the project (R15B) with the version containing the RELEASEscalability improvements (174) We also measure the impact of using the SD Erlang versiondeveloped in the project

To achieve these goals the scalability of one main case study will be studied namely the discrete timesimulation engine named Sim-Diasca (released as free software by EDF since 2010) whose purposeis to execute large simulations of complex systems (Section 21) For RELEASE a full benchmarkingsimulation case named City-example has been devised and implemented by EDF (Section 22) Ittranspires that the Sim-Diasca City instance scales up to at least 16 hosts (256 cores) on two clustersbut exhibits poor efficiency on both (Section 41) We investigate the scalability issues using bothstandard profiling tools (Section 41) and the new RELEASE tools (Sections 42 and 43)

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 hosts These are not the network connectivity issues that emerge at scales of around60 hosts in our Riak study [GCTM13] and are established in Erlang folklore To investigate the issues atthese larger scales we re-use the Orbit Benchmark (D31) and develop a new Ant Colony Optimisation(ACO) benchmark We find that SD-Erlang improves the performance of both ACO and Orbit beyond60 hosts (Sections 314 and 334) Moreover applications running on ErlangOTP R15B 174(Official)and 174 (RELEASE) all exhibit similar scaling eg similar speedup and runtime curves However the174 versions have slightly smaller runtimes than R15B on AMD architectures (Appendix B) whilethe converse holds on Intel Xeon machines (Section 33)

We demonstrate the deployment and monitoring of Sim-Diasca using our new WombatOAM tool(Section 5) Although the Sim-Diasca City instance has not reached the 60 host scale where SD Erlangtechniques can help we present a preliminary design for applying SD-Erlang to it (Section 52)

Partner Contributions EDF created the City-example simulation case based on Sim-Diasca (Sec-tion 2) provided access to the Blue GeneQ and Athos cluster and as lead participant coordinatedthe case study Glasgow designed and measured the ACO and Orbit benchmarks (Section 3) investi-gated the scalable performance of the City Sim-Diasca instance using conventional tools (Section 41)and proposed a design for incorporating SD-Erlang into the design of Sim-Diasca (Section 52) Kentprovided the Percept2 tool applied to Sim-Diasca by ICCS in Section 43 ICCS and Uppsala appliedalso BenchErl to Sim-Diasca (Section 42) Uppsala worked on the port of Erlang to the Blue GeneQwhich is outlined in Appendix A ESL developed a version of Sim-Diasca whose deployment relied on

ICT-287510 (RELEASE) 23rd December 2015 4

WombatOAM and demonstrated its deployment and monitoring with WombatOAM (Section 5) EABprovided the ErlangOTP releases on which we base our performance measurements

ICT-287510 (RELEASE) 23rd December 2015 5

2 The main case study

21 Sim-Diasca Overview

Sim-Diasca stands for Simulation of Discrete Systems of All Scales Sim-Diasca(httpwwwsim-diascacom) is a discrete-time simulation engine designed to be applied tolarge-scale complex systems This engine is developed by EDF RampD and it has been released since2010 as free software under the GNU LGPL licence

Simulators tend to be sizable if not massive and typical examples are large-scale informationsystems smart metering infrastructures involving millions of interacting devices full ecosystems theoperating components of utilities (energy waste etc) at the scale of entire cities etc As long as atarget system can be logically subdivided into (potentially very numerous) parts interacting over discretetime chances are that it can modelled according to Sim-Diascarsquos conventions and then simulated bythis engine

The overall objective of the engine is to evaluate correctly the models involved in a simu-lation and for that to preserve key properties - like causality a total reproducibility and somekind of rdquoergodicityrdquo (a fair exploration of the possible outcomes of the simulation1)

Preserving these properties would not be a real problem if the size of the simulated systems remainedwithin reasonable bounds As it is by design hardly the case for most complex systems (as extrapolatingtheir behaviour based on scale models is hazardous at best) the engine had to be designed so that itcan deal with up to millions of tightly interacting model instances Such simulations cannotbe evaluated unless major efforts are spent so that they are as much as possible parallel (they canmake use of all the cores of all the processors of a computer) and distributed (a set of networkedcomputers can be used in order to collectively run that single simulation) This is often needed to keepthe simulation durations (in wall-clock time) below a threshold (that could not be met if using at mostone core of one processor like many engines do) and to get access to enough memory (RAM) so thatthese simulations can exist at all

Once these concurrent (ie parallel and distributed) operations can be properly expressed andorganised they still have to be implemented and be effectively run on actual adequate processingresources typically HPC2 clusters or supercomputers such as EDFrsquos BluegeneQ

So the central difficulty is to preserve the aforementioned properties despite a massive concurrencyand very significant problem sizes scalability is surely at the heart of the Sim-Diasca use case Thisis all the more a challenge as these discrete time simulation engines are far from being embarrassinglyparallel problems one should not expect to see here perfect speed-ups as many interleaved operationshave to be finely synchronised by the engine so that all constraints are met opening up any underlyingpotential concurrency thus comes at a cost

More precisely based on the requested simulation frequency Sim-Diasca splits the simulated timeinto a series of time steps automatically skipping the ones that can be jumped over and reordering theinter-model messages so that properties like reproducibility are met Causality resolution requires thattime steps be further divided into as many logical moments (named diascas) as needed During a given

1Even in the absence of stochastic models concurrent events allow for multiple possible ldquolicitrdquo trajectories of thetarget system

2Meaning High Performance Computing

ICT-287510 (RELEASE) 23rd December 2015 6

diasca all model instances that have to be scheduled then will be evaluated full concurrently but thismassive parallelism can only happen between two (lightweight) distributed synchronisations3

This demand of scalability combined with the need to rely on HPC resources to evaluate suchlarger simulations makes the title of this deliverable Scalable Sim-Diasca for the BlueGene quiteself-explanatory

22 City Example

221 Overview of the simulation case

The City Example simulation case has been designed to provide an open sharable tractable yetrepresentative use case of Sim-Diasca for RELEASErsquos benchmarking purposes Sim-Diasca is indeeda simulation engine not a simulator hence we need to define a simulation on top of it to create abenchmark

The City example has been designed so that it is potentially arbitrarily scalable both in termsof duration and size there is no bounds to the duration in virtual time during which the targetcity can be evaluated (of course the wallclock time will in turn reflect this) nor to its size as this is atelescopic simulation case based on a target system (the city) that is according to various consistencyconstraints generated procedurally

Hence the City example can be used to benchmark arbitrarily long and large simulations reflectingthe typical issues that many real-world simulations exhibit Some examples include sequential phasesbecoming acute problems new bottlenecks appearing as the scale increases each resource showing acriticality profile etc

222 Description of the simulated elements

This specific simulation attempts to represent a few traits of a city ie the one that deals withwaste management and the one that corresponds to the weather system above it

The waste management system Before being simulated an artificial city must be procedurallygenerated For that a number of waste sources (residential or industrial) incinerators and landfills aredefined and a road network (made of roads and road junctions) is generated to interconnect them

A pool of waste trucks is then created and dispatched on the road network they will each strive atsimulation-time to transport wastes (multiple kinds of them are defined) so that the garbage producedby the various waste sources is collected and then transformed into incinerators resulting in bottomash that is then to be transported farther in landfills

In a properly balanced system none of the waste storage facilities will be saturated in the processincinerators will be appropriately fed and waste will not accumulate in the chain

An example of a road network corresponding to such a city is represented in Figure 1The waste system of these cities includes thus following elements- waste sources which are either residential (they are numerous producing each small quantities

of various waste types) or industrial (there are a few of them mostly producing large quantities ofmostly other waste types)

- incinerators each being able to burn some of these types of waste (the duration of this processdepending on several factors including which tank is used the kind of waste and the burners that areavailable for that) but producing in turn non-incinerable waste (bottom ashes)

- landfills which are able to store all kinds of wastes (incinerable or not) but are not able totransform them

3These synchronisations just operate so that a consensus on the next overall virtual timestamp is established

ICT-287510 (RELEASE) 23rd December 2015 7

Figure 1 A tiny instance of generated road network

- waste trucks which are able to transfer wastes from a point to another based on their logic(state machines with a queue of intents and some opportunistic traits) limited storage and possibilitiesof mixing wastes and limited knowledge of their surroundings4

- a road network which allows vehicles (currently only waste trucks) to reach points of interestthis is a directed cyclic graph whose nodes are the previous elements (ex an incinerator or a roadjunction) and whose edges are roads (with lengths and capacities their load affecting the speed ofvehicles on them as shown in Figure 2) this network is represented twice first as a dedicated initialgraph (in an associated global road network instance in our little GIS - currently not used in thecourse of the simulation as its memory footprint would quickly become overwhelming) and secondlyas the superposition of the information present in each point of interest and road (at this level theinformation is even duplicated as roads and points of interest both have to know each other ie toknow their direct connectivity)

An overall class diagram of the waste system is shown in Figure 3This waste management system is not so trivial as it involves a dozen classes and more than ten

thousand lines of Erlang codeWhile this case was very relevant to showcase how models driven by algorithms could interact

(with erratic scheduling and many dynamic aspects) its level of concurrency was found insufficient inpractice even if fairly numerous model instances were created on average at each diasca only a smallsubset of them could be scheduled hence this case was able to keep busy only a limited number ofcores simultaneously

To assess this issue a concurrency meter has been added to the engine so that it could report thenumber of diascas instantiated and for each of them how many model instances were scheduled Anaverage level of theoretical concurrency could then be reported5 and this showed indeed a level that

4These disaggregated individual-based simulations rely only upon decentralised partial information for example noagent - except before the simulation starts the mini-GIS - has a total knowledge of the road network (which during thesimulation does not exist as such for scalability reasons - it is merely an implicit graph)

5Reporting the diasca count has had an interesting side-effect as it allowed us to discover that in some cases the exactreproducibility of these simulations was lost After some difficult investigations we were able to exonerate the engine andfind the culprit a parallel phase of the initialisation of the road junctions could lead to having the list of their outgoingroads be permuted in some cases which could lead in turn far later in the simulations to waste trucks making different

ICT-287510 (RELEASE) 23rd December 2015 8

Figure 2 Vehicle speed based on the load of a road

Figure 3 Main classes and models of interest for the waste management system

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 2: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 2

Conclusion We have shown the challenges of significant scalability studies even simply being able torun non-trivial applications in larger settings is difficult not to mention that the benchmarking toolsand our ability to synthesize the resulting data must scale as well We established some commonarchitectural changes that distributed Erlang applications may adopt in order to alleviate some typicalscalability bottlenecks

Project funded under the European Community Framework 7 Programme (2011-14)Dissemination Level

PU Public gtPP Restricted to other programme participants (including the Commission Services)RE Restricted to a group specified by the consortium (including the Commission Services)CO Confidential only for members of the consortium (including the Commission Services)

ICT-287510 (RELEASE) 23rd December 2015 1

Scalability Case Studies Scalable Sim-Diasca for the Blue Gene

Contents

1 Executive Summary 3

2 The main case study 521 Sim-Diasca Overview 522 City Example 6

221 Overview of the simulation case 6222 Description of the simulated elements 6223 Additional changes done for benchmarking 11

3 Benchmarks 1331 Orbit 13

311 Running Orbit on Athos 14312 Distributed Erlang Orbit 15313 SD Erlang Orbit 16314 Experimental Evaluation 17315 Results on Other Architectures 18

32 Ant Colony Optimisation (ACO) 24321 ACO and SMTWTP 24322 Multi-colony approaches 24323 Evaluating Scalability 28324 Experimental Evaluation 29

33 Performance comparison of different ACO and Erlang versions on the Athos cluster 29331 Basic results 30332 Increasing the number of messages 32333 Some problematic results 32334 Network Traffic 38

34 Summary 38

4 Measurements 4041 Distributed Scalability 40

411 Performance 40412 Distributed Performance Analysis 41413 Discussion 44

42 BenchErl 4443 Percept2 45

5 Experiments 5051 Deploying Sim-Diasca with WombatOAM 50

511 The design of the implemented solution 50512 Deployment steps 51

52 SD Erlang Integration 53

6 Implications and Future Work 55

ICT-287510 (RELEASE) 23rd December 2015 2

A Porting ErlangOTP to the Blue GeneQ 56A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI 56A2 MPI Driver Internals 57A3 Current Status of the Blue GeneQ Port 58

B Single-machine ACO performance on various architectures and ErlangOTP re-leases 58B1 Experimental parameters 59B2 Discussion of results 61

B21 EDF Xeon machines 61B22 Glasgow Xeon machines 61B23 AMD machines 61

B3 Discussion 63

ICT-287510 (RELEASE) 23rd December 2015 3

1 Executive Summary

The stated objectives of this deliverable are to ldquoport Sim-Diasca to the Blue Gene architecture addinglocality control as requiredrdquo In the event we have interpreted this more generally studying twobenchmarks in addition to Sim-Diasca and measuring them on five parallel architectures (Section 315)as outlined below

The deliverable aims to study the scalability of Erlang programs in order to be able to process largerproblem sizes while making good use of available computing resources More precisely the overall aimhere is to study

bull How Erlang programs currently scale using for that large computing infrastructures like highperformance clusters We intended to investigate the performance on the Blue GeneQ super-computer but as outlined in Appendix A the corresponding port of the Erlang runtime was onlypartly functional due to issues at the level of the networking back-end Instead we have used 5conventional clusters

bull The extent to which the scalability of Erlang programs can be improved by adopting architecturalchanges and making various software choices We compare the performance of the ErlangOTPrelease that existed at the start of the project (R15B) with the version containing the RELEASEscalability improvements (174) We also measure the impact of using the SD Erlang versiondeveloped in the project

To achieve these goals the scalability of one main case study will be studied namely the discrete timesimulation engine named Sim-Diasca (released as free software by EDF since 2010) whose purposeis to execute large simulations of complex systems (Section 21) For RELEASE a full benchmarkingsimulation case named City-example has been devised and implemented by EDF (Section 22) Ittranspires that the Sim-Diasca City instance scales up to at least 16 hosts (256 cores) on two clustersbut exhibits poor efficiency on both (Section 41) We investigate the scalability issues using bothstandard profiling tools (Section 41) and the new RELEASE tools (Sections 42 and 43)

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 hosts These are not the network connectivity issues that emerge at scales of around60 hosts in our Riak study [GCTM13] and are established in Erlang folklore To investigate the issues atthese larger scales we re-use the Orbit Benchmark (D31) and develop a new Ant Colony Optimisation(ACO) benchmark We find that SD-Erlang improves the performance of both ACO and Orbit beyond60 hosts (Sections 314 and 334) Moreover applications running on ErlangOTP R15B 174(Official)and 174 (RELEASE) all exhibit similar scaling eg similar speedup and runtime curves However the174 versions have slightly smaller runtimes than R15B on AMD architectures (Appendix B) whilethe converse holds on Intel Xeon machines (Section 33)

We demonstrate the deployment and monitoring of Sim-Diasca using our new WombatOAM tool(Section 5) Although the Sim-Diasca City instance has not reached the 60 host scale where SD Erlangtechniques can help we present a preliminary design for applying SD-Erlang to it (Section 52)

Partner Contributions EDF created the City-example simulation case based on Sim-Diasca (Sec-tion 2) provided access to the Blue GeneQ and Athos cluster and as lead participant coordinatedthe case study Glasgow designed and measured the ACO and Orbit benchmarks (Section 3) investi-gated the scalable performance of the City Sim-Diasca instance using conventional tools (Section 41)and proposed a design for incorporating SD-Erlang into the design of Sim-Diasca (Section 52) Kentprovided the Percept2 tool applied to Sim-Diasca by ICCS in Section 43 ICCS and Uppsala appliedalso BenchErl to Sim-Diasca (Section 42) Uppsala worked on the port of Erlang to the Blue GeneQwhich is outlined in Appendix A ESL developed a version of Sim-Diasca whose deployment relied on

ICT-287510 (RELEASE) 23rd December 2015 4

WombatOAM and demonstrated its deployment and monitoring with WombatOAM (Section 5) EABprovided the ErlangOTP releases on which we base our performance measurements

ICT-287510 (RELEASE) 23rd December 2015 5

2 The main case study

21 Sim-Diasca Overview

Sim-Diasca stands for Simulation of Discrete Systems of All Scales Sim-Diasca(httpwwwsim-diascacom) is a discrete-time simulation engine designed to be applied tolarge-scale complex systems This engine is developed by EDF RampD and it has been released since2010 as free software under the GNU LGPL licence

Simulators tend to be sizable if not massive and typical examples are large-scale informationsystems smart metering infrastructures involving millions of interacting devices full ecosystems theoperating components of utilities (energy waste etc) at the scale of entire cities etc As long as atarget system can be logically subdivided into (potentially very numerous) parts interacting over discretetime chances are that it can modelled according to Sim-Diascarsquos conventions and then simulated bythis engine

The overall objective of the engine is to evaluate correctly the models involved in a simu-lation and for that to preserve key properties - like causality a total reproducibility and somekind of rdquoergodicityrdquo (a fair exploration of the possible outcomes of the simulation1)

Preserving these properties would not be a real problem if the size of the simulated systems remainedwithin reasonable bounds As it is by design hardly the case for most complex systems (as extrapolatingtheir behaviour based on scale models is hazardous at best) the engine had to be designed so that itcan deal with up to millions of tightly interacting model instances Such simulations cannotbe evaluated unless major efforts are spent so that they are as much as possible parallel (they canmake use of all the cores of all the processors of a computer) and distributed (a set of networkedcomputers can be used in order to collectively run that single simulation) This is often needed to keepthe simulation durations (in wall-clock time) below a threshold (that could not be met if using at mostone core of one processor like many engines do) and to get access to enough memory (RAM) so thatthese simulations can exist at all

Once these concurrent (ie parallel and distributed) operations can be properly expressed andorganised they still have to be implemented and be effectively run on actual adequate processingresources typically HPC2 clusters or supercomputers such as EDFrsquos BluegeneQ

So the central difficulty is to preserve the aforementioned properties despite a massive concurrencyand very significant problem sizes scalability is surely at the heart of the Sim-Diasca use case Thisis all the more a challenge as these discrete time simulation engines are far from being embarrassinglyparallel problems one should not expect to see here perfect speed-ups as many interleaved operationshave to be finely synchronised by the engine so that all constraints are met opening up any underlyingpotential concurrency thus comes at a cost

More precisely based on the requested simulation frequency Sim-Diasca splits the simulated timeinto a series of time steps automatically skipping the ones that can be jumped over and reordering theinter-model messages so that properties like reproducibility are met Causality resolution requires thattime steps be further divided into as many logical moments (named diascas) as needed During a given

1Even in the absence of stochastic models concurrent events allow for multiple possible ldquolicitrdquo trajectories of thetarget system

2Meaning High Performance Computing

ICT-287510 (RELEASE) 23rd December 2015 6

diasca all model instances that have to be scheduled then will be evaluated full concurrently but thismassive parallelism can only happen between two (lightweight) distributed synchronisations3

This demand of scalability combined with the need to rely on HPC resources to evaluate suchlarger simulations makes the title of this deliverable Scalable Sim-Diasca for the BlueGene quiteself-explanatory

22 City Example

221 Overview of the simulation case

The City Example simulation case has been designed to provide an open sharable tractable yetrepresentative use case of Sim-Diasca for RELEASErsquos benchmarking purposes Sim-Diasca is indeeda simulation engine not a simulator hence we need to define a simulation on top of it to create abenchmark

The City example has been designed so that it is potentially arbitrarily scalable both in termsof duration and size there is no bounds to the duration in virtual time during which the targetcity can be evaluated (of course the wallclock time will in turn reflect this) nor to its size as this is atelescopic simulation case based on a target system (the city) that is according to various consistencyconstraints generated procedurally

Hence the City example can be used to benchmark arbitrarily long and large simulations reflectingthe typical issues that many real-world simulations exhibit Some examples include sequential phasesbecoming acute problems new bottlenecks appearing as the scale increases each resource showing acriticality profile etc

222 Description of the simulated elements

This specific simulation attempts to represent a few traits of a city ie the one that deals withwaste management and the one that corresponds to the weather system above it

The waste management system Before being simulated an artificial city must be procedurallygenerated For that a number of waste sources (residential or industrial) incinerators and landfills aredefined and a road network (made of roads and road junctions) is generated to interconnect them

A pool of waste trucks is then created and dispatched on the road network they will each strive atsimulation-time to transport wastes (multiple kinds of them are defined) so that the garbage producedby the various waste sources is collected and then transformed into incinerators resulting in bottomash that is then to be transported farther in landfills

In a properly balanced system none of the waste storage facilities will be saturated in the processincinerators will be appropriately fed and waste will not accumulate in the chain

An example of a road network corresponding to such a city is represented in Figure 1The waste system of these cities includes thus following elements- waste sources which are either residential (they are numerous producing each small quantities

of various waste types) or industrial (there are a few of them mostly producing large quantities ofmostly other waste types)

- incinerators each being able to burn some of these types of waste (the duration of this processdepending on several factors including which tank is used the kind of waste and the burners that areavailable for that) but producing in turn non-incinerable waste (bottom ashes)

- landfills which are able to store all kinds of wastes (incinerable or not) but are not able totransform them

3These synchronisations just operate so that a consensus on the next overall virtual timestamp is established

ICT-287510 (RELEASE) 23rd December 2015 7

Figure 1 A tiny instance of generated road network

- waste trucks which are able to transfer wastes from a point to another based on their logic(state machines with a queue of intents and some opportunistic traits) limited storage and possibilitiesof mixing wastes and limited knowledge of their surroundings4

- a road network which allows vehicles (currently only waste trucks) to reach points of interestthis is a directed cyclic graph whose nodes are the previous elements (ex an incinerator or a roadjunction) and whose edges are roads (with lengths and capacities their load affecting the speed ofvehicles on them as shown in Figure 2) this network is represented twice first as a dedicated initialgraph (in an associated global road network instance in our little GIS - currently not used in thecourse of the simulation as its memory footprint would quickly become overwhelming) and secondlyas the superposition of the information present in each point of interest and road (at this level theinformation is even duplicated as roads and points of interest both have to know each other ie toknow their direct connectivity)

An overall class diagram of the waste system is shown in Figure 3This waste management system is not so trivial as it involves a dozen classes and more than ten

thousand lines of Erlang codeWhile this case was very relevant to showcase how models driven by algorithms could interact

(with erratic scheduling and many dynamic aspects) its level of concurrency was found insufficient inpractice even if fairly numerous model instances were created on average at each diasca only a smallsubset of them could be scheduled hence this case was able to keep busy only a limited number ofcores simultaneously

To assess this issue a concurrency meter has been added to the engine so that it could report thenumber of diascas instantiated and for each of them how many model instances were scheduled Anaverage level of theoretical concurrency could then be reported5 and this showed indeed a level that

4These disaggregated individual-based simulations rely only upon decentralised partial information for example noagent - except before the simulation starts the mini-GIS - has a total knowledge of the road network (which during thesimulation does not exist as such for scalability reasons - it is merely an implicit graph)

5Reporting the diasca count has had an interesting side-effect as it allowed us to discover that in some cases the exactreproducibility of these simulations was lost After some difficult investigations we were able to exonerate the engine andfind the culprit a parallel phase of the initialisation of the road junctions could lead to having the list of their outgoingroads be permuted in some cases which could lead in turn far later in the simulations to waste trucks making different

ICT-287510 (RELEASE) 23rd December 2015 8

Figure 2 Vehicle speed based on the load of a road

Figure 3 Main classes and models of interest for the waste management system

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 3: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 1

Scalability Case Studies Scalable Sim-Diasca for the Blue Gene

Contents

1 Executive Summary 3

2 The main case study 521 Sim-Diasca Overview 522 City Example 6

221 Overview of the simulation case 6222 Description of the simulated elements 6223 Additional changes done for benchmarking 11

3 Benchmarks 1331 Orbit 13

311 Running Orbit on Athos 14312 Distributed Erlang Orbit 15313 SD Erlang Orbit 16314 Experimental Evaluation 17315 Results on Other Architectures 18

32 Ant Colony Optimisation (ACO) 24321 ACO and SMTWTP 24322 Multi-colony approaches 24323 Evaluating Scalability 28324 Experimental Evaluation 29

33 Performance comparison of different ACO and Erlang versions on the Athos cluster 29331 Basic results 30332 Increasing the number of messages 32333 Some problematic results 32334 Network Traffic 38

34 Summary 38

4 Measurements 4041 Distributed Scalability 40

411 Performance 40412 Distributed Performance Analysis 41413 Discussion 44

42 BenchErl 4443 Percept2 45

5 Experiments 5051 Deploying Sim-Diasca with WombatOAM 50

511 The design of the implemented solution 50512 Deployment steps 51

52 SD Erlang Integration 53

6 Implications and Future Work 55

ICT-287510 (RELEASE) 23rd December 2015 2

A Porting ErlangOTP to the Blue GeneQ 56A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI 56A2 MPI Driver Internals 57A3 Current Status of the Blue GeneQ Port 58

B Single-machine ACO performance on various architectures and ErlangOTP re-leases 58B1 Experimental parameters 59B2 Discussion of results 61

B21 EDF Xeon machines 61B22 Glasgow Xeon machines 61B23 AMD machines 61

B3 Discussion 63

ICT-287510 (RELEASE) 23rd December 2015 3

1 Executive Summary

The stated objectives of this deliverable are to ldquoport Sim-Diasca to the Blue Gene architecture addinglocality control as requiredrdquo In the event we have interpreted this more generally studying twobenchmarks in addition to Sim-Diasca and measuring them on five parallel architectures (Section 315)as outlined below

The deliverable aims to study the scalability of Erlang programs in order to be able to process largerproblem sizes while making good use of available computing resources More precisely the overall aimhere is to study

bull How Erlang programs currently scale using for that large computing infrastructures like highperformance clusters We intended to investigate the performance on the Blue GeneQ super-computer but as outlined in Appendix A the corresponding port of the Erlang runtime was onlypartly functional due to issues at the level of the networking back-end Instead we have used 5conventional clusters

bull The extent to which the scalability of Erlang programs can be improved by adopting architecturalchanges and making various software choices We compare the performance of the ErlangOTPrelease that existed at the start of the project (R15B) with the version containing the RELEASEscalability improvements (174) We also measure the impact of using the SD Erlang versiondeveloped in the project

To achieve these goals the scalability of one main case study will be studied namely the discrete timesimulation engine named Sim-Diasca (released as free software by EDF since 2010) whose purposeis to execute large simulations of complex systems (Section 21) For RELEASE a full benchmarkingsimulation case named City-example has been devised and implemented by EDF (Section 22) Ittranspires that the Sim-Diasca City instance scales up to at least 16 hosts (256 cores) on two clustersbut exhibits poor efficiency on both (Section 41) We investigate the scalability issues using bothstandard profiling tools (Section 41) and the new RELEASE tools (Sections 42 and 43)

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 hosts These are not the network connectivity issues that emerge at scales of around60 hosts in our Riak study [GCTM13] and are established in Erlang folklore To investigate the issues atthese larger scales we re-use the Orbit Benchmark (D31) and develop a new Ant Colony Optimisation(ACO) benchmark We find that SD-Erlang improves the performance of both ACO and Orbit beyond60 hosts (Sections 314 and 334) Moreover applications running on ErlangOTP R15B 174(Official)and 174 (RELEASE) all exhibit similar scaling eg similar speedup and runtime curves However the174 versions have slightly smaller runtimes than R15B on AMD architectures (Appendix B) whilethe converse holds on Intel Xeon machines (Section 33)

We demonstrate the deployment and monitoring of Sim-Diasca using our new WombatOAM tool(Section 5) Although the Sim-Diasca City instance has not reached the 60 host scale where SD Erlangtechniques can help we present a preliminary design for applying SD-Erlang to it (Section 52)

Partner Contributions EDF created the City-example simulation case based on Sim-Diasca (Sec-tion 2) provided access to the Blue GeneQ and Athos cluster and as lead participant coordinatedthe case study Glasgow designed and measured the ACO and Orbit benchmarks (Section 3) investi-gated the scalable performance of the City Sim-Diasca instance using conventional tools (Section 41)and proposed a design for incorporating SD-Erlang into the design of Sim-Diasca (Section 52) Kentprovided the Percept2 tool applied to Sim-Diasca by ICCS in Section 43 ICCS and Uppsala appliedalso BenchErl to Sim-Diasca (Section 42) Uppsala worked on the port of Erlang to the Blue GeneQwhich is outlined in Appendix A ESL developed a version of Sim-Diasca whose deployment relied on

ICT-287510 (RELEASE) 23rd December 2015 4

WombatOAM and demonstrated its deployment and monitoring with WombatOAM (Section 5) EABprovided the ErlangOTP releases on which we base our performance measurements

ICT-287510 (RELEASE) 23rd December 2015 5

2 The main case study

21 Sim-Diasca Overview

Sim-Diasca stands for Simulation of Discrete Systems of All Scales Sim-Diasca(httpwwwsim-diascacom) is a discrete-time simulation engine designed to be applied tolarge-scale complex systems This engine is developed by EDF RampD and it has been released since2010 as free software under the GNU LGPL licence

Simulators tend to be sizable if not massive and typical examples are large-scale informationsystems smart metering infrastructures involving millions of interacting devices full ecosystems theoperating components of utilities (energy waste etc) at the scale of entire cities etc As long as atarget system can be logically subdivided into (potentially very numerous) parts interacting over discretetime chances are that it can modelled according to Sim-Diascarsquos conventions and then simulated bythis engine

The overall objective of the engine is to evaluate correctly the models involved in a simu-lation and for that to preserve key properties - like causality a total reproducibility and somekind of rdquoergodicityrdquo (a fair exploration of the possible outcomes of the simulation1)

Preserving these properties would not be a real problem if the size of the simulated systems remainedwithin reasonable bounds As it is by design hardly the case for most complex systems (as extrapolatingtheir behaviour based on scale models is hazardous at best) the engine had to be designed so that itcan deal with up to millions of tightly interacting model instances Such simulations cannotbe evaluated unless major efforts are spent so that they are as much as possible parallel (they canmake use of all the cores of all the processors of a computer) and distributed (a set of networkedcomputers can be used in order to collectively run that single simulation) This is often needed to keepthe simulation durations (in wall-clock time) below a threshold (that could not be met if using at mostone core of one processor like many engines do) and to get access to enough memory (RAM) so thatthese simulations can exist at all

Once these concurrent (ie parallel and distributed) operations can be properly expressed andorganised they still have to be implemented and be effectively run on actual adequate processingresources typically HPC2 clusters or supercomputers such as EDFrsquos BluegeneQ

So the central difficulty is to preserve the aforementioned properties despite a massive concurrencyand very significant problem sizes scalability is surely at the heart of the Sim-Diasca use case Thisis all the more a challenge as these discrete time simulation engines are far from being embarrassinglyparallel problems one should not expect to see here perfect speed-ups as many interleaved operationshave to be finely synchronised by the engine so that all constraints are met opening up any underlyingpotential concurrency thus comes at a cost

More precisely based on the requested simulation frequency Sim-Diasca splits the simulated timeinto a series of time steps automatically skipping the ones that can be jumped over and reordering theinter-model messages so that properties like reproducibility are met Causality resolution requires thattime steps be further divided into as many logical moments (named diascas) as needed During a given

1Even in the absence of stochastic models concurrent events allow for multiple possible ldquolicitrdquo trajectories of thetarget system

2Meaning High Performance Computing

ICT-287510 (RELEASE) 23rd December 2015 6

diasca all model instances that have to be scheduled then will be evaluated full concurrently but thismassive parallelism can only happen between two (lightweight) distributed synchronisations3

This demand of scalability combined with the need to rely on HPC resources to evaluate suchlarger simulations makes the title of this deliverable Scalable Sim-Diasca for the BlueGene quiteself-explanatory

22 City Example

221 Overview of the simulation case

The City Example simulation case has been designed to provide an open sharable tractable yetrepresentative use case of Sim-Diasca for RELEASErsquos benchmarking purposes Sim-Diasca is indeeda simulation engine not a simulator hence we need to define a simulation on top of it to create abenchmark

The City example has been designed so that it is potentially arbitrarily scalable both in termsof duration and size there is no bounds to the duration in virtual time during which the targetcity can be evaluated (of course the wallclock time will in turn reflect this) nor to its size as this is atelescopic simulation case based on a target system (the city) that is according to various consistencyconstraints generated procedurally

Hence the City example can be used to benchmark arbitrarily long and large simulations reflectingthe typical issues that many real-world simulations exhibit Some examples include sequential phasesbecoming acute problems new bottlenecks appearing as the scale increases each resource showing acriticality profile etc

222 Description of the simulated elements

This specific simulation attempts to represent a few traits of a city ie the one that deals withwaste management and the one that corresponds to the weather system above it

The waste management system Before being simulated an artificial city must be procedurallygenerated For that a number of waste sources (residential or industrial) incinerators and landfills aredefined and a road network (made of roads and road junctions) is generated to interconnect them

A pool of waste trucks is then created and dispatched on the road network they will each strive atsimulation-time to transport wastes (multiple kinds of them are defined) so that the garbage producedby the various waste sources is collected and then transformed into incinerators resulting in bottomash that is then to be transported farther in landfills

In a properly balanced system none of the waste storage facilities will be saturated in the processincinerators will be appropriately fed and waste will not accumulate in the chain

An example of a road network corresponding to such a city is represented in Figure 1The waste system of these cities includes thus following elements- waste sources which are either residential (they are numerous producing each small quantities

of various waste types) or industrial (there are a few of them mostly producing large quantities ofmostly other waste types)

- incinerators each being able to burn some of these types of waste (the duration of this processdepending on several factors including which tank is used the kind of waste and the burners that areavailable for that) but producing in turn non-incinerable waste (bottom ashes)

- landfills which are able to store all kinds of wastes (incinerable or not) but are not able totransform them

3These synchronisations just operate so that a consensus on the next overall virtual timestamp is established

ICT-287510 (RELEASE) 23rd December 2015 7

Figure 1 A tiny instance of generated road network

- waste trucks which are able to transfer wastes from a point to another based on their logic(state machines with a queue of intents and some opportunistic traits) limited storage and possibilitiesof mixing wastes and limited knowledge of their surroundings4

- a road network which allows vehicles (currently only waste trucks) to reach points of interestthis is a directed cyclic graph whose nodes are the previous elements (ex an incinerator or a roadjunction) and whose edges are roads (with lengths and capacities their load affecting the speed ofvehicles on them as shown in Figure 2) this network is represented twice first as a dedicated initialgraph (in an associated global road network instance in our little GIS - currently not used in thecourse of the simulation as its memory footprint would quickly become overwhelming) and secondlyas the superposition of the information present in each point of interest and road (at this level theinformation is even duplicated as roads and points of interest both have to know each other ie toknow their direct connectivity)

An overall class diagram of the waste system is shown in Figure 3This waste management system is not so trivial as it involves a dozen classes and more than ten

thousand lines of Erlang codeWhile this case was very relevant to showcase how models driven by algorithms could interact

(with erratic scheduling and many dynamic aspects) its level of concurrency was found insufficient inpractice even if fairly numerous model instances were created on average at each diasca only a smallsubset of them could be scheduled hence this case was able to keep busy only a limited number ofcores simultaneously

To assess this issue a concurrency meter has been added to the engine so that it could report thenumber of diascas instantiated and for each of them how many model instances were scheduled Anaverage level of theoretical concurrency could then be reported5 and this showed indeed a level that

4These disaggregated individual-based simulations rely only upon decentralised partial information for example noagent - except before the simulation starts the mini-GIS - has a total knowledge of the road network (which during thesimulation does not exist as such for scalability reasons - it is merely an implicit graph)

5Reporting the diasca count has had an interesting side-effect as it allowed us to discover that in some cases the exactreproducibility of these simulations was lost After some difficult investigations we were able to exonerate the engine andfind the culprit a parallel phase of the initialisation of the road junctions could lead to having the list of their outgoingroads be permuted in some cases which could lead in turn far later in the simulations to waste trucks making different

ICT-287510 (RELEASE) 23rd December 2015 8

Figure 2 Vehicle speed based on the load of a road

Figure 3 Main classes and models of interest for the waste management system

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 4: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 2

A Porting ErlangOTP to the Blue GeneQ 56A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI 56A2 MPI Driver Internals 57A3 Current Status of the Blue GeneQ Port 58

B Single-machine ACO performance on various architectures and ErlangOTP re-leases 58B1 Experimental parameters 59B2 Discussion of results 61

B21 EDF Xeon machines 61B22 Glasgow Xeon machines 61B23 AMD machines 61

B3 Discussion 63

ICT-287510 (RELEASE) 23rd December 2015 3

1 Executive Summary

The stated objectives of this deliverable are to ldquoport Sim-Diasca to the Blue Gene architecture addinglocality control as requiredrdquo In the event we have interpreted this more generally studying twobenchmarks in addition to Sim-Diasca and measuring them on five parallel architectures (Section 315)as outlined below

The deliverable aims to study the scalability of Erlang programs in order to be able to process largerproblem sizes while making good use of available computing resources More precisely the overall aimhere is to study

bull How Erlang programs currently scale using for that large computing infrastructures like highperformance clusters We intended to investigate the performance on the Blue GeneQ super-computer but as outlined in Appendix A the corresponding port of the Erlang runtime was onlypartly functional due to issues at the level of the networking back-end Instead we have used 5conventional clusters

bull The extent to which the scalability of Erlang programs can be improved by adopting architecturalchanges and making various software choices We compare the performance of the ErlangOTPrelease that existed at the start of the project (R15B) with the version containing the RELEASEscalability improvements (174) We also measure the impact of using the SD Erlang versiondeveloped in the project

To achieve these goals the scalability of one main case study will be studied namely the discrete timesimulation engine named Sim-Diasca (released as free software by EDF since 2010) whose purposeis to execute large simulations of complex systems (Section 21) For RELEASE a full benchmarkingsimulation case named City-example has been devised and implemented by EDF (Section 22) Ittranspires that the Sim-Diasca City instance scales up to at least 16 hosts (256 cores) on two clustersbut exhibits poor efficiency on both (Section 41) We investigate the scalability issues using bothstandard profiling tools (Section 41) and the new RELEASE tools (Sections 42 and 43)

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 hosts These are not the network connectivity issues that emerge at scales of around60 hosts in our Riak study [GCTM13] and are established in Erlang folklore To investigate the issues atthese larger scales we re-use the Orbit Benchmark (D31) and develop a new Ant Colony Optimisation(ACO) benchmark We find that SD-Erlang improves the performance of both ACO and Orbit beyond60 hosts (Sections 314 and 334) Moreover applications running on ErlangOTP R15B 174(Official)and 174 (RELEASE) all exhibit similar scaling eg similar speedup and runtime curves However the174 versions have slightly smaller runtimes than R15B on AMD architectures (Appendix B) whilethe converse holds on Intel Xeon machines (Section 33)

We demonstrate the deployment and monitoring of Sim-Diasca using our new WombatOAM tool(Section 5) Although the Sim-Diasca City instance has not reached the 60 host scale where SD Erlangtechniques can help we present a preliminary design for applying SD-Erlang to it (Section 52)

Partner Contributions EDF created the City-example simulation case based on Sim-Diasca (Sec-tion 2) provided access to the Blue GeneQ and Athos cluster and as lead participant coordinatedthe case study Glasgow designed and measured the ACO and Orbit benchmarks (Section 3) investi-gated the scalable performance of the City Sim-Diasca instance using conventional tools (Section 41)and proposed a design for incorporating SD-Erlang into the design of Sim-Diasca (Section 52) Kentprovided the Percept2 tool applied to Sim-Diasca by ICCS in Section 43 ICCS and Uppsala appliedalso BenchErl to Sim-Diasca (Section 42) Uppsala worked on the port of Erlang to the Blue GeneQwhich is outlined in Appendix A ESL developed a version of Sim-Diasca whose deployment relied on

ICT-287510 (RELEASE) 23rd December 2015 4

WombatOAM and demonstrated its deployment and monitoring with WombatOAM (Section 5) EABprovided the ErlangOTP releases on which we base our performance measurements

ICT-287510 (RELEASE) 23rd December 2015 5

2 The main case study

21 Sim-Diasca Overview

Sim-Diasca stands for Simulation of Discrete Systems of All Scales Sim-Diasca(httpwwwsim-diascacom) is a discrete-time simulation engine designed to be applied tolarge-scale complex systems This engine is developed by EDF RampD and it has been released since2010 as free software under the GNU LGPL licence

Simulators tend to be sizable if not massive and typical examples are large-scale informationsystems smart metering infrastructures involving millions of interacting devices full ecosystems theoperating components of utilities (energy waste etc) at the scale of entire cities etc As long as atarget system can be logically subdivided into (potentially very numerous) parts interacting over discretetime chances are that it can modelled according to Sim-Diascarsquos conventions and then simulated bythis engine

The overall objective of the engine is to evaluate correctly the models involved in a simu-lation and for that to preserve key properties - like causality a total reproducibility and somekind of rdquoergodicityrdquo (a fair exploration of the possible outcomes of the simulation1)

Preserving these properties would not be a real problem if the size of the simulated systems remainedwithin reasonable bounds As it is by design hardly the case for most complex systems (as extrapolatingtheir behaviour based on scale models is hazardous at best) the engine had to be designed so that itcan deal with up to millions of tightly interacting model instances Such simulations cannotbe evaluated unless major efforts are spent so that they are as much as possible parallel (they canmake use of all the cores of all the processors of a computer) and distributed (a set of networkedcomputers can be used in order to collectively run that single simulation) This is often needed to keepthe simulation durations (in wall-clock time) below a threshold (that could not be met if using at mostone core of one processor like many engines do) and to get access to enough memory (RAM) so thatthese simulations can exist at all

Once these concurrent (ie parallel and distributed) operations can be properly expressed andorganised they still have to be implemented and be effectively run on actual adequate processingresources typically HPC2 clusters or supercomputers such as EDFrsquos BluegeneQ

So the central difficulty is to preserve the aforementioned properties despite a massive concurrencyand very significant problem sizes scalability is surely at the heart of the Sim-Diasca use case Thisis all the more a challenge as these discrete time simulation engines are far from being embarrassinglyparallel problems one should not expect to see here perfect speed-ups as many interleaved operationshave to be finely synchronised by the engine so that all constraints are met opening up any underlyingpotential concurrency thus comes at a cost

More precisely based on the requested simulation frequency Sim-Diasca splits the simulated timeinto a series of time steps automatically skipping the ones that can be jumped over and reordering theinter-model messages so that properties like reproducibility are met Causality resolution requires thattime steps be further divided into as many logical moments (named diascas) as needed During a given

1Even in the absence of stochastic models concurrent events allow for multiple possible ldquolicitrdquo trajectories of thetarget system

2Meaning High Performance Computing

ICT-287510 (RELEASE) 23rd December 2015 6

diasca all model instances that have to be scheduled then will be evaluated full concurrently but thismassive parallelism can only happen between two (lightweight) distributed synchronisations3

This demand of scalability combined with the need to rely on HPC resources to evaluate suchlarger simulations makes the title of this deliverable Scalable Sim-Diasca for the BlueGene quiteself-explanatory

22 City Example

221 Overview of the simulation case

The City Example simulation case has been designed to provide an open sharable tractable yetrepresentative use case of Sim-Diasca for RELEASErsquos benchmarking purposes Sim-Diasca is indeeda simulation engine not a simulator hence we need to define a simulation on top of it to create abenchmark

The City example has been designed so that it is potentially arbitrarily scalable both in termsof duration and size there is no bounds to the duration in virtual time during which the targetcity can be evaluated (of course the wallclock time will in turn reflect this) nor to its size as this is atelescopic simulation case based on a target system (the city) that is according to various consistencyconstraints generated procedurally

Hence the City example can be used to benchmark arbitrarily long and large simulations reflectingthe typical issues that many real-world simulations exhibit Some examples include sequential phasesbecoming acute problems new bottlenecks appearing as the scale increases each resource showing acriticality profile etc

222 Description of the simulated elements

This specific simulation attempts to represent a few traits of a city ie the one that deals withwaste management and the one that corresponds to the weather system above it

The waste management system Before being simulated an artificial city must be procedurallygenerated For that a number of waste sources (residential or industrial) incinerators and landfills aredefined and a road network (made of roads and road junctions) is generated to interconnect them

A pool of waste trucks is then created and dispatched on the road network they will each strive atsimulation-time to transport wastes (multiple kinds of them are defined) so that the garbage producedby the various waste sources is collected and then transformed into incinerators resulting in bottomash that is then to be transported farther in landfills

In a properly balanced system none of the waste storage facilities will be saturated in the processincinerators will be appropriately fed and waste will not accumulate in the chain

An example of a road network corresponding to such a city is represented in Figure 1The waste system of these cities includes thus following elements- waste sources which are either residential (they are numerous producing each small quantities

of various waste types) or industrial (there are a few of them mostly producing large quantities ofmostly other waste types)

- incinerators each being able to burn some of these types of waste (the duration of this processdepending on several factors including which tank is used the kind of waste and the burners that areavailable for that) but producing in turn non-incinerable waste (bottom ashes)

- landfills which are able to store all kinds of wastes (incinerable or not) but are not able totransform them

3These synchronisations just operate so that a consensus on the next overall virtual timestamp is established

ICT-287510 (RELEASE) 23rd December 2015 7

Figure 1 A tiny instance of generated road network

- waste trucks which are able to transfer wastes from a point to another based on their logic(state machines with a queue of intents and some opportunistic traits) limited storage and possibilitiesof mixing wastes and limited knowledge of their surroundings4

- a road network which allows vehicles (currently only waste trucks) to reach points of interestthis is a directed cyclic graph whose nodes are the previous elements (ex an incinerator or a roadjunction) and whose edges are roads (with lengths and capacities their load affecting the speed ofvehicles on them as shown in Figure 2) this network is represented twice first as a dedicated initialgraph (in an associated global road network instance in our little GIS - currently not used in thecourse of the simulation as its memory footprint would quickly become overwhelming) and secondlyas the superposition of the information present in each point of interest and road (at this level theinformation is even duplicated as roads and points of interest both have to know each other ie toknow their direct connectivity)

An overall class diagram of the waste system is shown in Figure 3This waste management system is not so trivial as it involves a dozen classes and more than ten

thousand lines of Erlang codeWhile this case was very relevant to showcase how models driven by algorithms could interact

(with erratic scheduling and many dynamic aspects) its level of concurrency was found insufficient inpractice even if fairly numerous model instances were created on average at each diasca only a smallsubset of them could be scheduled hence this case was able to keep busy only a limited number ofcores simultaneously

To assess this issue a concurrency meter has been added to the engine so that it could report thenumber of diascas instantiated and for each of them how many model instances were scheduled Anaverage level of theoretical concurrency could then be reported5 and this showed indeed a level that

4These disaggregated individual-based simulations rely only upon decentralised partial information for example noagent - except before the simulation starts the mini-GIS - has a total knowledge of the road network (which during thesimulation does not exist as such for scalability reasons - it is merely an implicit graph)

5Reporting the diasca count has had an interesting side-effect as it allowed us to discover that in some cases the exactreproducibility of these simulations was lost After some difficult investigations we were able to exonerate the engine andfind the culprit a parallel phase of the initialisation of the road junctions could lead to having the list of their outgoingroads be permuted in some cases which could lead in turn far later in the simulations to waste trucks making different

ICT-287510 (RELEASE) 23rd December 2015 8

Figure 2 Vehicle speed based on the load of a road

Figure 3 Main classes and models of interest for the waste management system

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 5: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 3

1 Executive Summary

The stated objectives of this deliverable are to ldquoport Sim-Diasca to the Blue Gene architecture addinglocality control as requiredrdquo In the event we have interpreted this more generally studying twobenchmarks in addition to Sim-Diasca and measuring them on five parallel architectures (Section 315)as outlined below

The deliverable aims to study the scalability of Erlang programs in order to be able to process largerproblem sizes while making good use of available computing resources More precisely the overall aimhere is to study

bull How Erlang programs currently scale using for that large computing infrastructures like highperformance clusters We intended to investigate the performance on the Blue GeneQ super-computer but as outlined in Appendix A the corresponding port of the Erlang runtime was onlypartly functional due to issues at the level of the networking back-end Instead we have used 5conventional clusters

bull The extent to which the scalability of Erlang programs can be improved by adopting architecturalchanges and making various software choices We compare the performance of the ErlangOTPrelease that existed at the start of the project (R15B) with the version containing the RELEASEscalability improvements (174) We also measure the impact of using the SD Erlang versiondeveloped in the project

To achieve these goals the scalability of one main case study will be studied namely the discrete timesimulation engine named Sim-Diasca (released as free software by EDF since 2010) whose purposeis to execute large simulations of complex systems (Section 21) For RELEASE a full benchmarkingsimulation case named City-example has been devised and implemented by EDF (Section 22) Ittranspires that the Sim-Diasca City instance scales up to at least 16 hosts (256 cores) on two clustersbut exhibits poor efficiency on both (Section 41) We investigate the scalability issues using bothstandard profiling tools (Section 41) and the new RELEASE tools (Sections 42 and 43)

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 hosts These are not the network connectivity issues that emerge at scales of around60 hosts in our Riak study [GCTM13] and are established in Erlang folklore To investigate the issues atthese larger scales we re-use the Orbit Benchmark (D31) and develop a new Ant Colony Optimisation(ACO) benchmark We find that SD-Erlang improves the performance of both ACO and Orbit beyond60 hosts (Sections 314 and 334) Moreover applications running on ErlangOTP R15B 174(Official)and 174 (RELEASE) all exhibit similar scaling eg similar speedup and runtime curves However the174 versions have slightly smaller runtimes than R15B on AMD architectures (Appendix B) whilethe converse holds on Intel Xeon machines (Section 33)

We demonstrate the deployment and monitoring of Sim-Diasca using our new WombatOAM tool(Section 5) Although the Sim-Diasca City instance has not reached the 60 host scale where SD Erlangtechniques can help we present a preliminary design for applying SD-Erlang to it (Section 52)

Partner Contributions EDF created the City-example simulation case based on Sim-Diasca (Sec-tion 2) provided access to the Blue GeneQ and Athos cluster and as lead participant coordinatedthe case study Glasgow designed and measured the ACO and Orbit benchmarks (Section 3) investi-gated the scalable performance of the City Sim-Diasca instance using conventional tools (Section 41)and proposed a design for incorporating SD-Erlang into the design of Sim-Diasca (Section 52) Kentprovided the Percept2 tool applied to Sim-Diasca by ICCS in Section 43 ICCS and Uppsala appliedalso BenchErl to Sim-Diasca (Section 42) Uppsala worked on the port of Erlang to the Blue GeneQwhich is outlined in Appendix A ESL developed a version of Sim-Diasca whose deployment relied on

ICT-287510 (RELEASE) 23rd December 2015 4

WombatOAM and demonstrated its deployment and monitoring with WombatOAM (Section 5) EABprovided the ErlangOTP releases on which we base our performance measurements

ICT-287510 (RELEASE) 23rd December 2015 5

2 The main case study

21 Sim-Diasca Overview

Sim-Diasca stands for Simulation of Discrete Systems of All Scales Sim-Diasca(httpwwwsim-diascacom) is a discrete-time simulation engine designed to be applied tolarge-scale complex systems This engine is developed by EDF RampD and it has been released since2010 as free software under the GNU LGPL licence

Simulators tend to be sizable if not massive and typical examples are large-scale informationsystems smart metering infrastructures involving millions of interacting devices full ecosystems theoperating components of utilities (energy waste etc) at the scale of entire cities etc As long as atarget system can be logically subdivided into (potentially very numerous) parts interacting over discretetime chances are that it can modelled according to Sim-Diascarsquos conventions and then simulated bythis engine

The overall objective of the engine is to evaluate correctly the models involved in a simu-lation and for that to preserve key properties - like causality a total reproducibility and somekind of rdquoergodicityrdquo (a fair exploration of the possible outcomes of the simulation1)

Preserving these properties would not be a real problem if the size of the simulated systems remainedwithin reasonable bounds As it is by design hardly the case for most complex systems (as extrapolatingtheir behaviour based on scale models is hazardous at best) the engine had to be designed so that itcan deal with up to millions of tightly interacting model instances Such simulations cannotbe evaluated unless major efforts are spent so that they are as much as possible parallel (they canmake use of all the cores of all the processors of a computer) and distributed (a set of networkedcomputers can be used in order to collectively run that single simulation) This is often needed to keepthe simulation durations (in wall-clock time) below a threshold (that could not be met if using at mostone core of one processor like many engines do) and to get access to enough memory (RAM) so thatthese simulations can exist at all

Once these concurrent (ie parallel and distributed) operations can be properly expressed andorganised they still have to be implemented and be effectively run on actual adequate processingresources typically HPC2 clusters or supercomputers such as EDFrsquos BluegeneQ

So the central difficulty is to preserve the aforementioned properties despite a massive concurrencyand very significant problem sizes scalability is surely at the heart of the Sim-Diasca use case Thisis all the more a challenge as these discrete time simulation engines are far from being embarrassinglyparallel problems one should not expect to see here perfect speed-ups as many interleaved operationshave to be finely synchronised by the engine so that all constraints are met opening up any underlyingpotential concurrency thus comes at a cost

More precisely based on the requested simulation frequency Sim-Diasca splits the simulated timeinto a series of time steps automatically skipping the ones that can be jumped over and reordering theinter-model messages so that properties like reproducibility are met Causality resolution requires thattime steps be further divided into as many logical moments (named diascas) as needed During a given

1Even in the absence of stochastic models concurrent events allow for multiple possible ldquolicitrdquo trajectories of thetarget system

2Meaning High Performance Computing

ICT-287510 (RELEASE) 23rd December 2015 6

diasca all model instances that have to be scheduled then will be evaluated full concurrently but thismassive parallelism can only happen between two (lightweight) distributed synchronisations3

This demand of scalability combined with the need to rely on HPC resources to evaluate suchlarger simulations makes the title of this deliverable Scalable Sim-Diasca for the BlueGene quiteself-explanatory

22 City Example

221 Overview of the simulation case

The City Example simulation case has been designed to provide an open sharable tractable yetrepresentative use case of Sim-Diasca for RELEASErsquos benchmarking purposes Sim-Diasca is indeeda simulation engine not a simulator hence we need to define a simulation on top of it to create abenchmark

The City example has been designed so that it is potentially arbitrarily scalable both in termsof duration and size there is no bounds to the duration in virtual time during which the targetcity can be evaluated (of course the wallclock time will in turn reflect this) nor to its size as this is atelescopic simulation case based on a target system (the city) that is according to various consistencyconstraints generated procedurally

Hence the City example can be used to benchmark arbitrarily long and large simulations reflectingthe typical issues that many real-world simulations exhibit Some examples include sequential phasesbecoming acute problems new bottlenecks appearing as the scale increases each resource showing acriticality profile etc

222 Description of the simulated elements

This specific simulation attempts to represent a few traits of a city ie the one that deals withwaste management and the one that corresponds to the weather system above it

The waste management system Before being simulated an artificial city must be procedurallygenerated For that a number of waste sources (residential or industrial) incinerators and landfills aredefined and a road network (made of roads and road junctions) is generated to interconnect them

A pool of waste trucks is then created and dispatched on the road network they will each strive atsimulation-time to transport wastes (multiple kinds of them are defined) so that the garbage producedby the various waste sources is collected and then transformed into incinerators resulting in bottomash that is then to be transported farther in landfills

In a properly balanced system none of the waste storage facilities will be saturated in the processincinerators will be appropriately fed and waste will not accumulate in the chain

An example of a road network corresponding to such a city is represented in Figure 1The waste system of these cities includes thus following elements- waste sources which are either residential (they are numerous producing each small quantities

of various waste types) or industrial (there are a few of them mostly producing large quantities ofmostly other waste types)

- incinerators each being able to burn some of these types of waste (the duration of this processdepending on several factors including which tank is used the kind of waste and the burners that areavailable for that) but producing in turn non-incinerable waste (bottom ashes)

- landfills which are able to store all kinds of wastes (incinerable or not) but are not able totransform them

3These synchronisations just operate so that a consensus on the next overall virtual timestamp is established

ICT-287510 (RELEASE) 23rd December 2015 7

Figure 1 A tiny instance of generated road network

- waste trucks which are able to transfer wastes from a point to another based on their logic(state machines with a queue of intents and some opportunistic traits) limited storage and possibilitiesof mixing wastes and limited knowledge of their surroundings4

- a road network which allows vehicles (currently only waste trucks) to reach points of interestthis is a directed cyclic graph whose nodes are the previous elements (ex an incinerator or a roadjunction) and whose edges are roads (with lengths and capacities their load affecting the speed ofvehicles on them as shown in Figure 2) this network is represented twice first as a dedicated initialgraph (in an associated global road network instance in our little GIS - currently not used in thecourse of the simulation as its memory footprint would quickly become overwhelming) and secondlyas the superposition of the information present in each point of interest and road (at this level theinformation is even duplicated as roads and points of interest both have to know each other ie toknow their direct connectivity)

An overall class diagram of the waste system is shown in Figure 3This waste management system is not so trivial as it involves a dozen classes and more than ten

thousand lines of Erlang codeWhile this case was very relevant to showcase how models driven by algorithms could interact

(with erratic scheduling and many dynamic aspects) its level of concurrency was found insufficient inpractice even if fairly numerous model instances were created on average at each diasca only a smallsubset of them could be scheduled hence this case was able to keep busy only a limited number ofcores simultaneously

To assess this issue a concurrency meter has been added to the engine so that it could report thenumber of diascas instantiated and for each of them how many model instances were scheduled Anaverage level of theoretical concurrency could then be reported5 and this showed indeed a level that

4These disaggregated individual-based simulations rely only upon decentralised partial information for example noagent - except before the simulation starts the mini-GIS - has a total knowledge of the road network (which during thesimulation does not exist as such for scalability reasons - it is merely an implicit graph)

5Reporting the diasca count has had an interesting side-effect as it allowed us to discover that in some cases the exactreproducibility of these simulations was lost After some difficult investigations we were able to exonerate the engine andfind the culprit a parallel phase of the initialisation of the road junctions could lead to having the list of their outgoingroads be permuted in some cases which could lead in turn far later in the simulations to waste trucks making different

ICT-287510 (RELEASE) 23rd December 2015 8

Figure 2 Vehicle speed based on the load of a road

Figure 3 Main classes and models of interest for the waste management system

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 6: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 4

WombatOAM and demonstrated its deployment and monitoring with WombatOAM (Section 5) EABprovided the ErlangOTP releases on which we base our performance measurements

ICT-287510 (RELEASE) 23rd December 2015 5

2 The main case study

21 Sim-Diasca Overview

Sim-Diasca stands for Simulation of Discrete Systems of All Scales Sim-Diasca(httpwwwsim-diascacom) is a discrete-time simulation engine designed to be applied tolarge-scale complex systems This engine is developed by EDF RampD and it has been released since2010 as free software under the GNU LGPL licence

Simulators tend to be sizable if not massive and typical examples are large-scale informationsystems smart metering infrastructures involving millions of interacting devices full ecosystems theoperating components of utilities (energy waste etc) at the scale of entire cities etc As long as atarget system can be logically subdivided into (potentially very numerous) parts interacting over discretetime chances are that it can modelled according to Sim-Diascarsquos conventions and then simulated bythis engine

The overall objective of the engine is to evaluate correctly the models involved in a simu-lation and for that to preserve key properties - like causality a total reproducibility and somekind of rdquoergodicityrdquo (a fair exploration of the possible outcomes of the simulation1)

Preserving these properties would not be a real problem if the size of the simulated systems remainedwithin reasonable bounds As it is by design hardly the case for most complex systems (as extrapolatingtheir behaviour based on scale models is hazardous at best) the engine had to be designed so that itcan deal with up to millions of tightly interacting model instances Such simulations cannotbe evaluated unless major efforts are spent so that they are as much as possible parallel (they canmake use of all the cores of all the processors of a computer) and distributed (a set of networkedcomputers can be used in order to collectively run that single simulation) This is often needed to keepthe simulation durations (in wall-clock time) below a threshold (that could not be met if using at mostone core of one processor like many engines do) and to get access to enough memory (RAM) so thatthese simulations can exist at all

Once these concurrent (ie parallel and distributed) operations can be properly expressed andorganised they still have to be implemented and be effectively run on actual adequate processingresources typically HPC2 clusters or supercomputers such as EDFrsquos BluegeneQ

So the central difficulty is to preserve the aforementioned properties despite a massive concurrencyand very significant problem sizes scalability is surely at the heart of the Sim-Diasca use case Thisis all the more a challenge as these discrete time simulation engines are far from being embarrassinglyparallel problems one should not expect to see here perfect speed-ups as many interleaved operationshave to be finely synchronised by the engine so that all constraints are met opening up any underlyingpotential concurrency thus comes at a cost

More precisely based on the requested simulation frequency Sim-Diasca splits the simulated timeinto a series of time steps automatically skipping the ones that can be jumped over and reordering theinter-model messages so that properties like reproducibility are met Causality resolution requires thattime steps be further divided into as many logical moments (named diascas) as needed During a given

1Even in the absence of stochastic models concurrent events allow for multiple possible ldquolicitrdquo trajectories of thetarget system

2Meaning High Performance Computing

ICT-287510 (RELEASE) 23rd December 2015 6

diasca all model instances that have to be scheduled then will be evaluated full concurrently but thismassive parallelism can only happen between two (lightweight) distributed synchronisations3

This demand of scalability combined with the need to rely on HPC resources to evaluate suchlarger simulations makes the title of this deliverable Scalable Sim-Diasca for the BlueGene quiteself-explanatory

22 City Example

221 Overview of the simulation case

The City Example simulation case has been designed to provide an open sharable tractable yetrepresentative use case of Sim-Diasca for RELEASErsquos benchmarking purposes Sim-Diasca is indeeda simulation engine not a simulator hence we need to define a simulation on top of it to create abenchmark

The City example has been designed so that it is potentially arbitrarily scalable both in termsof duration and size there is no bounds to the duration in virtual time during which the targetcity can be evaluated (of course the wallclock time will in turn reflect this) nor to its size as this is atelescopic simulation case based on a target system (the city) that is according to various consistencyconstraints generated procedurally

Hence the City example can be used to benchmark arbitrarily long and large simulations reflectingthe typical issues that many real-world simulations exhibit Some examples include sequential phasesbecoming acute problems new bottlenecks appearing as the scale increases each resource showing acriticality profile etc

222 Description of the simulated elements

This specific simulation attempts to represent a few traits of a city ie the one that deals withwaste management and the one that corresponds to the weather system above it

The waste management system Before being simulated an artificial city must be procedurallygenerated For that a number of waste sources (residential or industrial) incinerators and landfills aredefined and a road network (made of roads and road junctions) is generated to interconnect them

A pool of waste trucks is then created and dispatched on the road network they will each strive atsimulation-time to transport wastes (multiple kinds of them are defined) so that the garbage producedby the various waste sources is collected and then transformed into incinerators resulting in bottomash that is then to be transported farther in landfills

In a properly balanced system none of the waste storage facilities will be saturated in the processincinerators will be appropriately fed and waste will not accumulate in the chain

An example of a road network corresponding to such a city is represented in Figure 1The waste system of these cities includes thus following elements- waste sources which are either residential (they are numerous producing each small quantities

of various waste types) or industrial (there are a few of them mostly producing large quantities ofmostly other waste types)

- incinerators each being able to burn some of these types of waste (the duration of this processdepending on several factors including which tank is used the kind of waste and the burners that areavailable for that) but producing in turn non-incinerable waste (bottom ashes)

- landfills which are able to store all kinds of wastes (incinerable or not) but are not able totransform them

3These synchronisations just operate so that a consensus on the next overall virtual timestamp is established

ICT-287510 (RELEASE) 23rd December 2015 7

Figure 1 A tiny instance of generated road network

- waste trucks which are able to transfer wastes from a point to another based on their logic(state machines with a queue of intents and some opportunistic traits) limited storage and possibilitiesof mixing wastes and limited knowledge of their surroundings4

- a road network which allows vehicles (currently only waste trucks) to reach points of interestthis is a directed cyclic graph whose nodes are the previous elements (ex an incinerator or a roadjunction) and whose edges are roads (with lengths and capacities their load affecting the speed ofvehicles on them as shown in Figure 2) this network is represented twice first as a dedicated initialgraph (in an associated global road network instance in our little GIS - currently not used in thecourse of the simulation as its memory footprint would quickly become overwhelming) and secondlyas the superposition of the information present in each point of interest and road (at this level theinformation is even duplicated as roads and points of interest both have to know each other ie toknow their direct connectivity)

An overall class diagram of the waste system is shown in Figure 3This waste management system is not so trivial as it involves a dozen classes and more than ten

thousand lines of Erlang codeWhile this case was very relevant to showcase how models driven by algorithms could interact

(with erratic scheduling and many dynamic aspects) its level of concurrency was found insufficient inpractice even if fairly numerous model instances were created on average at each diasca only a smallsubset of them could be scheduled hence this case was able to keep busy only a limited number ofcores simultaneously

To assess this issue a concurrency meter has been added to the engine so that it could report thenumber of diascas instantiated and for each of them how many model instances were scheduled Anaverage level of theoretical concurrency could then be reported5 and this showed indeed a level that

4These disaggregated individual-based simulations rely only upon decentralised partial information for example noagent - except before the simulation starts the mini-GIS - has a total knowledge of the road network (which during thesimulation does not exist as such for scalability reasons - it is merely an implicit graph)

5Reporting the diasca count has had an interesting side-effect as it allowed us to discover that in some cases the exactreproducibility of these simulations was lost After some difficult investigations we were able to exonerate the engine andfind the culprit a parallel phase of the initialisation of the road junctions could lead to having the list of their outgoingroads be permuted in some cases which could lead in turn far later in the simulations to waste trucks making different

ICT-287510 (RELEASE) 23rd December 2015 8

Figure 2 Vehicle speed based on the load of a road

Figure 3 Main classes and models of interest for the waste management system

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 7: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 5

2 The main case study

21 Sim-Diasca Overview

Sim-Diasca stands for Simulation of Discrete Systems of All Scales Sim-Diasca(httpwwwsim-diascacom) is a discrete-time simulation engine designed to be applied tolarge-scale complex systems This engine is developed by EDF RampD and it has been released since2010 as free software under the GNU LGPL licence

Simulators tend to be sizable if not massive and typical examples are large-scale informationsystems smart metering infrastructures involving millions of interacting devices full ecosystems theoperating components of utilities (energy waste etc) at the scale of entire cities etc As long as atarget system can be logically subdivided into (potentially very numerous) parts interacting over discretetime chances are that it can modelled according to Sim-Diascarsquos conventions and then simulated bythis engine

The overall objective of the engine is to evaluate correctly the models involved in a simu-lation and for that to preserve key properties - like causality a total reproducibility and somekind of rdquoergodicityrdquo (a fair exploration of the possible outcomes of the simulation1)

Preserving these properties would not be a real problem if the size of the simulated systems remainedwithin reasonable bounds As it is by design hardly the case for most complex systems (as extrapolatingtheir behaviour based on scale models is hazardous at best) the engine had to be designed so that itcan deal with up to millions of tightly interacting model instances Such simulations cannotbe evaluated unless major efforts are spent so that they are as much as possible parallel (they canmake use of all the cores of all the processors of a computer) and distributed (a set of networkedcomputers can be used in order to collectively run that single simulation) This is often needed to keepthe simulation durations (in wall-clock time) below a threshold (that could not be met if using at mostone core of one processor like many engines do) and to get access to enough memory (RAM) so thatthese simulations can exist at all

Once these concurrent (ie parallel and distributed) operations can be properly expressed andorganised they still have to be implemented and be effectively run on actual adequate processingresources typically HPC2 clusters or supercomputers such as EDFrsquos BluegeneQ

So the central difficulty is to preserve the aforementioned properties despite a massive concurrencyand very significant problem sizes scalability is surely at the heart of the Sim-Diasca use case Thisis all the more a challenge as these discrete time simulation engines are far from being embarrassinglyparallel problems one should not expect to see here perfect speed-ups as many interleaved operationshave to be finely synchronised by the engine so that all constraints are met opening up any underlyingpotential concurrency thus comes at a cost

More precisely based on the requested simulation frequency Sim-Diasca splits the simulated timeinto a series of time steps automatically skipping the ones that can be jumped over and reordering theinter-model messages so that properties like reproducibility are met Causality resolution requires thattime steps be further divided into as many logical moments (named diascas) as needed During a given

1Even in the absence of stochastic models concurrent events allow for multiple possible ldquolicitrdquo trajectories of thetarget system

2Meaning High Performance Computing

ICT-287510 (RELEASE) 23rd December 2015 6

diasca all model instances that have to be scheduled then will be evaluated full concurrently but thismassive parallelism can only happen between two (lightweight) distributed synchronisations3

This demand of scalability combined with the need to rely on HPC resources to evaluate suchlarger simulations makes the title of this deliverable Scalable Sim-Diasca for the BlueGene quiteself-explanatory

22 City Example

221 Overview of the simulation case

The City Example simulation case has been designed to provide an open sharable tractable yetrepresentative use case of Sim-Diasca for RELEASErsquos benchmarking purposes Sim-Diasca is indeeda simulation engine not a simulator hence we need to define a simulation on top of it to create abenchmark

The City example has been designed so that it is potentially arbitrarily scalable both in termsof duration and size there is no bounds to the duration in virtual time during which the targetcity can be evaluated (of course the wallclock time will in turn reflect this) nor to its size as this is atelescopic simulation case based on a target system (the city) that is according to various consistencyconstraints generated procedurally

Hence the City example can be used to benchmark arbitrarily long and large simulations reflectingthe typical issues that many real-world simulations exhibit Some examples include sequential phasesbecoming acute problems new bottlenecks appearing as the scale increases each resource showing acriticality profile etc

222 Description of the simulated elements

This specific simulation attempts to represent a few traits of a city ie the one that deals withwaste management and the one that corresponds to the weather system above it

The waste management system Before being simulated an artificial city must be procedurallygenerated For that a number of waste sources (residential or industrial) incinerators and landfills aredefined and a road network (made of roads and road junctions) is generated to interconnect them

A pool of waste trucks is then created and dispatched on the road network they will each strive atsimulation-time to transport wastes (multiple kinds of them are defined) so that the garbage producedby the various waste sources is collected and then transformed into incinerators resulting in bottomash that is then to be transported farther in landfills

In a properly balanced system none of the waste storage facilities will be saturated in the processincinerators will be appropriately fed and waste will not accumulate in the chain

An example of a road network corresponding to such a city is represented in Figure 1The waste system of these cities includes thus following elements- waste sources which are either residential (they are numerous producing each small quantities

of various waste types) or industrial (there are a few of them mostly producing large quantities ofmostly other waste types)

- incinerators each being able to burn some of these types of waste (the duration of this processdepending on several factors including which tank is used the kind of waste and the burners that areavailable for that) but producing in turn non-incinerable waste (bottom ashes)

- landfills which are able to store all kinds of wastes (incinerable or not) but are not able totransform them

3These synchronisations just operate so that a consensus on the next overall virtual timestamp is established

ICT-287510 (RELEASE) 23rd December 2015 7

Figure 1 A tiny instance of generated road network

- waste trucks which are able to transfer wastes from a point to another based on their logic(state machines with a queue of intents and some opportunistic traits) limited storage and possibilitiesof mixing wastes and limited knowledge of their surroundings4

- a road network which allows vehicles (currently only waste trucks) to reach points of interestthis is a directed cyclic graph whose nodes are the previous elements (ex an incinerator or a roadjunction) and whose edges are roads (with lengths and capacities their load affecting the speed ofvehicles on them as shown in Figure 2) this network is represented twice first as a dedicated initialgraph (in an associated global road network instance in our little GIS - currently not used in thecourse of the simulation as its memory footprint would quickly become overwhelming) and secondlyas the superposition of the information present in each point of interest and road (at this level theinformation is even duplicated as roads and points of interest both have to know each other ie toknow their direct connectivity)

An overall class diagram of the waste system is shown in Figure 3This waste management system is not so trivial as it involves a dozen classes and more than ten

thousand lines of Erlang codeWhile this case was very relevant to showcase how models driven by algorithms could interact

(with erratic scheduling and many dynamic aspects) its level of concurrency was found insufficient inpractice even if fairly numerous model instances were created on average at each diasca only a smallsubset of them could be scheduled hence this case was able to keep busy only a limited number ofcores simultaneously

To assess this issue a concurrency meter has been added to the engine so that it could report thenumber of diascas instantiated and for each of them how many model instances were scheduled Anaverage level of theoretical concurrency could then be reported5 and this showed indeed a level that

4These disaggregated individual-based simulations rely only upon decentralised partial information for example noagent - except before the simulation starts the mini-GIS - has a total knowledge of the road network (which during thesimulation does not exist as such for scalability reasons - it is merely an implicit graph)

5Reporting the diasca count has had an interesting side-effect as it allowed us to discover that in some cases the exactreproducibility of these simulations was lost After some difficult investigations we were able to exonerate the engine andfind the culprit a parallel phase of the initialisation of the road junctions could lead to having the list of their outgoingroads be permuted in some cases which could lead in turn far later in the simulations to waste trucks making different

ICT-287510 (RELEASE) 23rd December 2015 8

Figure 2 Vehicle speed based on the load of a road

Figure 3 Main classes and models of interest for the waste management system

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 8: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 6

diasca all model instances that have to be scheduled then will be evaluated full concurrently but thismassive parallelism can only happen between two (lightweight) distributed synchronisations3

This demand of scalability combined with the need to rely on HPC resources to evaluate suchlarger simulations makes the title of this deliverable Scalable Sim-Diasca for the BlueGene quiteself-explanatory

22 City Example

221 Overview of the simulation case

The City Example simulation case has been designed to provide an open sharable tractable yetrepresentative use case of Sim-Diasca for RELEASErsquos benchmarking purposes Sim-Diasca is indeeda simulation engine not a simulator hence we need to define a simulation on top of it to create abenchmark

The City example has been designed so that it is potentially arbitrarily scalable both in termsof duration and size there is no bounds to the duration in virtual time during which the targetcity can be evaluated (of course the wallclock time will in turn reflect this) nor to its size as this is atelescopic simulation case based on a target system (the city) that is according to various consistencyconstraints generated procedurally

Hence the City example can be used to benchmark arbitrarily long and large simulations reflectingthe typical issues that many real-world simulations exhibit Some examples include sequential phasesbecoming acute problems new bottlenecks appearing as the scale increases each resource showing acriticality profile etc

222 Description of the simulated elements

This specific simulation attempts to represent a few traits of a city ie the one that deals withwaste management and the one that corresponds to the weather system above it

The waste management system Before being simulated an artificial city must be procedurallygenerated For that a number of waste sources (residential or industrial) incinerators and landfills aredefined and a road network (made of roads and road junctions) is generated to interconnect them

A pool of waste trucks is then created and dispatched on the road network they will each strive atsimulation-time to transport wastes (multiple kinds of them are defined) so that the garbage producedby the various waste sources is collected and then transformed into incinerators resulting in bottomash that is then to be transported farther in landfills

In a properly balanced system none of the waste storage facilities will be saturated in the processincinerators will be appropriately fed and waste will not accumulate in the chain

An example of a road network corresponding to such a city is represented in Figure 1The waste system of these cities includes thus following elements- waste sources which are either residential (they are numerous producing each small quantities

of various waste types) or industrial (there are a few of them mostly producing large quantities ofmostly other waste types)

- incinerators each being able to burn some of these types of waste (the duration of this processdepending on several factors including which tank is used the kind of waste and the burners that areavailable for that) but producing in turn non-incinerable waste (bottom ashes)

- landfills which are able to store all kinds of wastes (incinerable or not) but are not able totransform them

3These synchronisations just operate so that a consensus on the next overall virtual timestamp is established

ICT-287510 (RELEASE) 23rd December 2015 7

Figure 1 A tiny instance of generated road network

- waste trucks which are able to transfer wastes from a point to another based on their logic(state machines with a queue of intents and some opportunistic traits) limited storage and possibilitiesof mixing wastes and limited knowledge of their surroundings4

- a road network which allows vehicles (currently only waste trucks) to reach points of interestthis is a directed cyclic graph whose nodes are the previous elements (ex an incinerator or a roadjunction) and whose edges are roads (with lengths and capacities their load affecting the speed ofvehicles on them as shown in Figure 2) this network is represented twice first as a dedicated initialgraph (in an associated global road network instance in our little GIS - currently not used in thecourse of the simulation as its memory footprint would quickly become overwhelming) and secondlyas the superposition of the information present in each point of interest and road (at this level theinformation is even duplicated as roads and points of interest both have to know each other ie toknow their direct connectivity)

An overall class diagram of the waste system is shown in Figure 3This waste management system is not so trivial as it involves a dozen classes and more than ten

thousand lines of Erlang codeWhile this case was very relevant to showcase how models driven by algorithms could interact

(with erratic scheduling and many dynamic aspects) its level of concurrency was found insufficient inpractice even if fairly numerous model instances were created on average at each diasca only a smallsubset of them could be scheduled hence this case was able to keep busy only a limited number ofcores simultaneously

To assess this issue a concurrency meter has been added to the engine so that it could report thenumber of diascas instantiated and for each of them how many model instances were scheduled Anaverage level of theoretical concurrency could then be reported5 and this showed indeed a level that

4These disaggregated individual-based simulations rely only upon decentralised partial information for example noagent - except before the simulation starts the mini-GIS - has a total knowledge of the road network (which during thesimulation does not exist as such for scalability reasons - it is merely an implicit graph)

5Reporting the diasca count has had an interesting side-effect as it allowed us to discover that in some cases the exactreproducibility of these simulations was lost After some difficult investigations we were able to exonerate the engine andfind the culprit a parallel phase of the initialisation of the road junctions could lead to having the list of their outgoingroads be permuted in some cases which could lead in turn far later in the simulations to waste trucks making different

ICT-287510 (RELEASE) 23rd December 2015 8

Figure 2 Vehicle speed based on the load of a road

Figure 3 Main classes and models of interest for the waste management system

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 9: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 7

Figure 1 A tiny instance of generated road network

- waste trucks which are able to transfer wastes from a point to another based on their logic(state machines with a queue of intents and some opportunistic traits) limited storage and possibilitiesof mixing wastes and limited knowledge of their surroundings4

- a road network which allows vehicles (currently only waste trucks) to reach points of interestthis is a directed cyclic graph whose nodes are the previous elements (ex an incinerator or a roadjunction) and whose edges are roads (with lengths and capacities their load affecting the speed ofvehicles on them as shown in Figure 2) this network is represented twice first as a dedicated initialgraph (in an associated global road network instance in our little GIS - currently not used in thecourse of the simulation as its memory footprint would quickly become overwhelming) and secondlyas the superposition of the information present in each point of interest and road (at this level theinformation is even duplicated as roads and points of interest both have to know each other ie toknow their direct connectivity)

An overall class diagram of the waste system is shown in Figure 3This waste management system is not so trivial as it involves a dozen classes and more than ten

thousand lines of Erlang codeWhile this case was very relevant to showcase how models driven by algorithms could interact

(with erratic scheduling and many dynamic aspects) its level of concurrency was found insufficient inpractice even if fairly numerous model instances were created on average at each diasca only a smallsubset of them could be scheduled hence this case was able to keep busy only a limited number ofcores simultaneously

To assess this issue a concurrency meter has been added to the engine so that it could report thenumber of diascas instantiated and for each of them how many model instances were scheduled Anaverage level of theoretical concurrency could then be reported5 and this showed indeed a level that

4These disaggregated individual-based simulations rely only upon decentralised partial information for example noagent - except before the simulation starts the mini-GIS - has a total knowledge of the road network (which during thesimulation does not exist as such for scalability reasons - it is merely an implicit graph)

5Reporting the diasca count has had an interesting side-effect as it allowed us to discover that in some cases the exactreproducibility of these simulations was lost After some difficult investigations we were able to exonerate the engine andfind the culprit a parallel phase of the initialisation of the road junctions could lead to having the list of their outgoingroads be permuted in some cases which could lead in turn far later in the simulations to waste trucks making different

ICT-287510 (RELEASE) 23rd December 2015 8

Figure 2 Vehicle speed based on the load of a road

Figure 3 Main classes and models of interest for the waste management system

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 10: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 8

Figure 2 Vehicle speed based on the load of a road

Figure 3 Main classes and models of interest for the waste management system

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 11: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 9

Figure 4 Phases of a few weather cells recreating Lorenzrsquos strange attractor

once converted into a lower actual concurrency was insufficientThe overall scale of the case was thus increased in order to alleviate this problem but this had a

still worse impact on the memory and network capabilities whose limits were then reached first as aresult obtaining a high processing load was not easily achievable in that setting

The root of the problem was lying in the waste-related models which are less CPU-bound thanmemory-bound or network-bound applying their behavioral rules does not require so much processingwhile the model instances maintain fairly complex states and communicate a lot - and these traits couldnot be easily changed

A new dimension to this simulation case had thus to be added

The weather system To ensure that the City-example case became more CPU-bound we introduceda new domain of interest the weather above the city modelled in a very simplistic way

A regular grid of weather cell models has been added Each of these cells manages a few localphysical quantities (like temperature pressure and hydrometry) They all start with different initialconditions yet are ruled by the same set of Lorenz equations

Each cell based on its state solves numerically these differential equations thanks to a Runge-Kutta fourth-order method It is additionally unsettled by its neighbours as adjacent cells influenceeach other

Various cell trajectories in the phase space are shown in Figure 4These models have been very useful in order to tune the level of resource demanded by the City-

example case we can select a grid of weather cells as fine as needed hence increasing their numberand the computing load they induce

choices among routes of equal interest resulting into different simulation outcomes The problem was discovered relativelylate as the engine probes were deactivated since long not to hinder scalability

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 12: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 10

Figure 5 Expected scalability profile

Indeed thanks to the use of its embedded numerical solver a weather cell model requires signif-icantly more processing power that most waste-related models and this load is rather homogeneousin (simulated) time and space As moreover each cell has a small memory footprint (needing just tostore its current physical state and references onto the adjacent cells) and induces few predictable in-teractions (up to four actor messages being sent during its spontaneous behaviour and as many beingreceived during its triggered one) it is a perfect fit to control a processing demand independently fromthe other requested resources

As a result of this weather addition we obtained a complete simulation case overall mixingtwo modelling paradigms (algorithmic and equation-driven respectively for the waste and for theweather domains) able to adopt approximately any scale in terms of time (duration of a simulation)and space (size of the city hence scale of the problem) that can moreover be finely and easily tunedin terms of respective resource consumption

Should we have to try to figure out the actual resulting scalability before even running the corre-sponding experiments the rough profile shown in Figure 5 would be expected

Typically in these distributed large-scale simulations for a given scale if the number of computinghosts is below a first threshold the simulation will not be able to run at all as the total memoryfootprint of the simulation will exceed the available (distributed) one

Then as soon as the strict minimum amount of resources will be reached the simulations will beable to run They will start by being most probably CPU-bound as on average there should manymore model instances to schedule at a given diasca than there are available cores of a single computinghost simulations will then be increasingly faster as the number of hosts (hence cores) will increase

Adding still more hosts will remove this second resource barrier (first being memory second beingprocessing) but will progressively lead to scatter more and more the interacting instances across thehosts6 - thus increasingly replacing local communications by networked ones and slowing down the

6Even with a smart load balancer the degradation is likely to be very significant as by default when using N computinghosts the probability that an interaction can remain local is 1N

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 13: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 11

overall simulationAs a result one would expect three well-defined operating areas with regards to number of hosts

and a single sweet spot to exist

223 Additional changes done for benchmarking

We went through various steps in order to ease the benchmarking actions by adapting and enhancingSim-Diasca andor the City-example case

A first issue was that the procedural generation of the target city was long and that thisduration was very quickly increasing as the scale was growing - notably because of the embedded mini-GIS7 which was operating sequentially and whose load was exponentially growing with the number ofspatialised instances to manage8

Efforts were done in order to remove that GIS bottleneck and have these initialisations bemore parallel but the generation of the initial state of the simulation remained quite long for largersimulations (ex more than two full days of generation before starting the evaluation of the simulationitself)

We thus uncoupled the generation of the initial state from the simulation That way foreach scale of interest for the city (from tiny to huge) we could generate first once for all a correspondinginitialisation file and then share it and run as many simulations as wanted from it

This two-stage approach involved the definition of

bull a domain-agnostic compact expressive initialisation file format in order to describe how initialmodel instances shall be created

bull a fairly powerful loading mechanism able to cope with cyclic references and allowing for alargely parallel processing thereof

This newer scheme allowed the actual simulations to bypass the heavy sequential GIS computationssince their precomputed result could be directly from a pre-established file If indeed the pre-simulationphases were shortened the creation of the initial instances itself remained a demanding operation evenif it was largely made quite parallel

Last changes that were made dealt with the integration of third-party tools to Sim-Diasca likeBenchErl and Percept2

Distributed applications like Sim-Diasca have of course their own deployment services (often withapplication-specific logic for the selection of hosts node creation naming and setting the creation anddeployment of a case-specific archive with relevant code and data etc) while BenchErl expected to becontrolling that

Changes were made in the engine so that BenchErl could take care of the deployment by its owninstead then a simple script was written allowing to run Sim-Diasca directly from an Erlang shell(hence possibly having set-up any context needed by BenchErl)

If an ad hoc solution for the BenchErl integration could finally be devised not only the deploymentwas remaining a general problem as soon as third party tools (ex Percept2) had to be applied tothe engine but other strong needs had to be addressed a two-way exchange may have to take placebetween the engine and the tool of interest so that for example the former could tell the latter whichwere the elected nodes and notify it when each simulation phase began or finished (ex monitoring the

7GIS stands for Geographic Information System8The procedural generation had notably to ensure that any two interconnected points of interest respected minimal

distances otherwise the shorter roads would lead to traffic durations that would be brief to the point of inducing whenbeing quantised over the simulation time-step a relative error above the default threshold allowed by the engine Sim-Diasca would then detect this violation at runtime and stop the simulation on error

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 14: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 12

initial loading could not be of interest for benchmarking purposes) and so that the latter could requestsettings updates (ex requested number of schedulers for the computing nodes) to the former

To allow for such an uncoupling a plugin system has been implemented in the engine and thePercept2 integration made use of it

On that technical basis measurements were performed Results and findings will be discussed insection 4

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 15: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 13

3 Benchmarks

To improve scalability of distributed Erlang we have designed and implemented Scalable DistributedErlang (SD Erlang) [CLTG14] which enables to control locality and reduce connectivity That is SDErlang offers an alternative connectivity model for distributed Erlang In this model nodes are groupedinto a number of s groups nodes have transitive connections with nodes from the same s group andnon-transitive connections with other nodes Moreover SD Erlang provides group name registrationas a scalable alternative to global name registration In this model there is no global name space butevery s group has its own namespace which is shared among the group members only

In this section we investigates the scalability of two benchmarks Orbit (Section 31) and ACO(Section 32) on large scale systems with up to 256 hosts (6144 cores) We compare the scalability ofthree versions of ErlangOTP ErlangOTP R15B (Erl-R15B) ErlangOTP 174 (Erl-174) and SDErlangOTP 174 (SDErl-174) Erl-R15B is the ErlangOTP version that was released at the start ofthe RELEASE project and is available from httpwwwerlangorgdownload_release13SDErl-174 is the SD Erlang version based on Erl-174 that was released at the end of the project andcan be found here httpsgithubcomrelease-projectotptree174-rebased Weconclude this section by summarising the results of the experiments (Section 34)

The Athos cluster and SLURM The benchmarks we present in this section are run on the Athoscluster located in EDF France Athos has 776 compute nodes called atcn001ndashatcn776 each of thesehas 64GB of RAM and an Intel Xeon E5-2697 v2 processor with 12 cores and two hardware threadsper core In the RELEASE project we have simultaneous access to up to 256 nodes (6144 hardwarethreads) for up to 8 hours at a time

Users interact with the cluster via a front-end node and initially have no access to any of thecompute nodes Access to compute nodes is obtained via the SLURM workload manager (see httpslurmschedmdcom) either interactively or via a batch script (see below) which specifies howmany nodes are required and for how long Jobs wait in a queue until sufficient resources are availableand then SLURM allocates a number of compute nodes which then become accessible via ssh Theuser has exclusive access to these machines and no-one elsersquos code will be running at the same timeFragmentation issues mean that jobs are not usually allocated a single contiguous block of machines butrather some subset scattered across the cluster for example atcn[127-144163-180217-288487-504537-648667-684] These will be interspersed with machines allocated to other userssee Figure 6 which shows a screenshot from SLURMrsquos smap command at a time when the ATHOScluster was fairly busyThe area at the top contains a string of characters one for each machine in the cluster (wrapping roundat the end of lines in the usual way) Dots represent unallocated machines and coloured alphanumericcharacters correspond to the jobs running on the machines information about some of the jobs is shownin the lower part of the figure with usernames and job names obscured Note for example how the jobslabelled S and V are fragmented

Users can request specific (and perhaps contiguous) node allocations but it may take a long timebefore the desired nodes are all free at once leading to a very long wait in the SLURM queue Afurther complication is that it appears that the node names do not correspond exactly to the physicalstructure of the cluster see [REL15 444]

31 Orbit

Orbit is a symbolic computing kernel and a generalization of a transitive closure computation [LN01]To compute the Orbit for a given space [0X] a list of generators g1 g2 gn are applied on an initialvertex x0 isin [0X] This creates new numbers (x1xn) isin [0X] The generator functions are applied

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 16: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 14

Figure 6 SLURM allocation

on the new numbers until no new number is generatedThe following features in Orbit make the benchmark a desirable case study for the RELEASE

project

bull It uses a Distributed Hash Table (DHT) similar to NoSQL DBMS like Riak [Bas14] that usesreplicated DHTs

bull It uses standard peer-to-peer (P2P) techniques and creditrecovery distributed termination de-tection algorithm

bull It is only a few hundred lines and has a good performance and extensibility

In this section we introduce how we run Orbit on the Athos cluster then provide an overview ofdistributed Erlang Orbit (D-Orbit) and SD Erlang Orbit (SD-Obit)

311 Running Orbit on Athos

We run the benchmark by calling the run-slurm script either putting it in a queue ie$ sbatch -N256 -c24 -t300 --partition=comp --qos=release run-slurm

or executing the script immediately ie$ salloc -N10 -c24 -t30 run-slurm

The latter is used when the number of requested Athos hosts is small (in our case it is up to 60 nodes)and is mainly used to check whether the script works Here N is the number of Athos hosts c is thenumber of cores per node t is the requested time in seconds and qos=release is the RELEASEproject quota that enables to request up to 256 Athos hosts

To run the experiments we need to define parameters in the run-slurm script (Figure 7)

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 17: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 15

FROMNUMNODES is the minimum number of nodes on which we run the experiment inthe first run

STEPNODES is the step that we use to increase the number of nodes in thesubsequent runs

NUMREPEAT is the number of times each experiment will run

Figure 7 Parameters in run-slurm

Figure 8 Communication Model in Distributed Erlang Orbit

For example we request 10 nodes and set the parameters to the following $FROMNUMNODES=4$STEPNODES=3 $NUMREPEAT=2 then the experiment will run on 4 7 and 10 nodes and every experi-ments will run twice

For every run we start the defined number of Erlang VMs which is equal to the number of Athoshosts ie one Erlang node per Athos hosts then we run the experiment using timtest script and stopthe VMs We also tried to run the experiments without stopping the VMs for every run but in thiscase the results are inconsistent that is sometimes the first run takes significantly longer than the restof the experiments and sometimes with every run the time per experiment increases So we decidedto start and stop VMs for every run even though it takes longer in comparison with experiments whenwe use the same VMs for all runs

The module function and parameters which are called to run the experiments we define in thetimetest script The Orbit parameters do not change in the experiments that we report in Sec-tion 314 so we define them in the bench athoserl module when calling benchdist4 function

312 Distributed Erlang Orbit

In the distributed Erlang Orbit all nodes are interconnected (Figure 8) The master process initiatesthe Orbit computation on all worker nodes and each worker node has connections to all other workernodes Worker nodes communicate directly with each other and report results to the master nodeEach worker process owns part of a distributed hash table A hash function is applied on a generatednumber to find in which part of the hash table this number should be stored

To detect the termination of Orbit computation a creditrecovery distributed algorithm is used [MC98]Initially the master process has a specific amount of credit Each active process holds a portion of thecredit and when a process becomes passive ie inactive for a specific period of time it sends the credit itholds to active processes Therefore when the master process collects the credit it can detect whetherthe computation has finished

The code together with SLURM scripts that we use to run D-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarks

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 18: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 16

Figure 9 D-Orbit Performance Depending on the Number of Worker Processes

scalability-measurementsOrbitd-orbit-code

Parameters In the experiments we discuss in Section 314 we use the following parameters

bull Orbit generator is benchg123451

bull We run experiments for the following initial Orbit space 2 lowast 106 3 lowast 106 4 lowast 106 5 lowast 106 elements

To identify an optimal number of worker processes per worker node we ran a set of experiments ona single node with Orbit size equal to 2M elements changing the number of worker processes as follows4 8 16 24 32 48 We ran the experiments using Erl-R15B (Figure 9) and SDErl-174 We repeatedeach experiment 5 times The results show that 8 worker processes per worker node provide the bestperformance for both versions of Erlang

313 SD Erlang Orbit

In SD Erlang version of Orbit we group nodes into sets of s groups Here we have two types of s groupsmaster and worker (Figure 10) There is only one master s group that the master node and all sub-master nodes belong to and an arbitrary number of worker s groups Each worker s group has onlyone sub-master node and a number of worker nodes

Recall that in SD Erlang nodes have transitive connections with nodes from the same s groups andnon-transitive connections with remaining nodes Therefore to reduce the total number of connectionswithin an s group worker nodes communicate directly with each other but when a worker node needsto communicate with a node outside its own s group the communication is done via sub-master nodesThe number of connections of a worker node is equal to the number of worker nodes in its worker s group

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 19: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 17

Figure 10 Communication Model in SD Erlang Orbit

The number of connections of a sub-master node is equal to the number of worker nodes in the workers group plus the number of sub-master nodes in the master s group That is in a cluster with a totalnumber of N nodes a worker node in distributed Erlang Orbit has (N minus 1) TCP connections whereasin SD-Orbit where each worker s group has M nodes a worker node has (M minus 1) TCP connections

and a sub-master node has (M minus 1 +N minus 1

M) connections

An Orbit computation is started by the master process on the master node The master processspawns two types of processes on every sub-master node a submaster process and gateway processesA sub-master process is responsible for the initiation and termination of worker processes in its workers group collecting credit and data and forwarding the collected data to the master process A gatewayprocess forwards messages between worker nodes from different s groups

The code together with SLURM scripts that we use to run SD-Orbit on Athos can be found herehttpsgithubcomrelease-projectRELEASEtreemasterResearchBenchmarksscalability-measurementsOrbitsd-orbit-code

Parameters On top of parameters we define in Section 312 for SD-Orbit we defined the followingadditional parameters

bull Sub-master nodes are on separate Athos hosts from worker nodes

bull Each sub-master s group contains one sub-master node and ten worker nodes

To define the number of gateway processes on sub-master nodes we ran an experiment with 2s groups varying the number of gateway processes as follows 30 40 50 The results show that on thisparticular configuration the number of gateway processes does not have a significant impact on theSD-Orbit performance So we have chosen 40 gateway processes per sub-master node

314 Experimental Evaluation

Figures 11(a) and 11(b) show runtime and speedup of D-Orbit and SD-Orbit implementations Thespeedup is a ratio between execution time on one node with one core and the execution time on

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 20: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 18

No Machines Configuration Availability Processor RAM DistributedErlangPort

Name Location Hosts Coresperhost

Totalcores

Maxcores

Waittime

1 GPG GLA 20 16 320 320 0 Xeon E5-2640 v22GHz

Yes

2 TinTin Uppsala 160 16 2560 - Yes

3 Kalkyl Uppsala 8 varies Yes

4 Athos EDF 776 24 18624 6144 varies Xeon E5-2697 v2 27GHz

64GB Yes

5 Zumbrota EDF 4096 16 65536 17hrs BlueGeneQ (Pow-erPC A2)

No

Table 1 Machines Available for Benchmarking in the RELEASE Project

corresponding number of nodes and cores In the experiments we use Erl-R15B and SDErl-174 Foreach of the experiments we plot standard deviation Every experiment was repeated seven timesThe results show that D-Orbit scales identically in Erl-R15B and SDErl-174 and after 40 nodesthe performance starts degrading However performance in Erl-R15B is better that in SDErl-174SD-Orbit scales worse than D-Orbit on a small number of nodes but as the number of nodes growsSD-Orbit performs better (beyond 80 nodes) and the performance does not degrade as the number ofnodes grows

Figures 12(a) and 12(b) depict D-Orbit performance in SDErl-174 depending on the size of Orbitthat changes from 2M to 5M elements The results show that after reaching a pick the performancestarts to degrade as the number of nodes continues to grow This trend is not observed in the corre-sponding SD-Orbit experiments (Figures 13(a) and 13(b)) We show D-Orbit and SD-Orbit performanceside by side for 2M and 5M elements in Figures 14(a) and 14(b) Again SD-Orbit scales better as thenumber of nodes grows and unlike D-Orbit its performance does not deteriorate

When we increase the size of Orbit beyond 5M the D-Orbit version fails due to the fact that someVMs exceed available RAM of 64GB This kind of failure triggers Athos hosts to go down and thena human involvement is required to restart the hosts The way SLURM works a user is not informedof the reasons of the failures immediately so when we ran D-Orbit experiments of size 12M over aweekend we unknowingly put out of action approximately 157 Athos hosts and were informed of theissue only the following Monday However we did not experience this problem when running SD-Orbitexperiments even of size 60M

We also observed that independently of the Orbit size an optimal number of worker processes perworker node for both D-Orbit and SD-Orbit is 8 This is true for both Erl-R15B and SDErl-174versions of Erlang

315 Results on Other Architectures

Table 1 presents information about machines available for benchmarking to the RELEASE projectApart from Athos cluster we ran Orbit experiments on the following two clusters GPG and KalkylResults of running Orbit on the Kalkyl cluster are presented in Figures 15(a) and 15(b) These resultsare consistent with the results we observe on the Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 21: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 19

(a) Runtime

(b) Speedup

Figure 11 D-Orbit and SD-Orbit Performance in Erl-R15B and SDErl-174

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 22: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 20

(a) Runtime

(b) Speedup

Figure 12 D-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 23: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 21

(a) Runtime

(b) Speedup

Figure 13 SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 24: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 22

(a) Runtime

(b) Speedup

Figure 14 D-Orbit and SD-Orbit Performance in SD ErlangOTP 174

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 25: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 23

(a) Runtime

(b) Speedup

Figure 15 D-Orbit and SD-Orbit Performance on Kalkyl Cluster

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 26: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 24

32 Ant Colony Optimisation (ACO)

In this section we discuss scalability of the Ant Colony Optimisation (ACO) benchmark For a detaileddescription of ACO refer to deliverable D34 Scalable Reliable OTP Library Release [REL14a] Thecode for different versions of ACO that we discuss in this section is open source and can be found herehttpsgithubcomrelease-projectbenchmarkstreemasterACO

321 ACO and SMTWTP

Ant Colony Optimisation [DS04] is a metaheuristic which has been applied to a large number of combi-natorial optimisation problems In the RELEASE project we have applied it to an NP-hard schedulingproblem known as the Single Machine Total Weighted Tardiness Problem (SMTWTP) [McN59] wherea number of jobs of given lengths have to be arranged in a single linear schedule The goal is to minimisethe cost of the schedule as determined by certain constraints

Single-colony ACO Suppose we have an SMTWTP instance of size N (ie we have N jobs toschedule) In the basic ACO strategy we have a colony containing a number of ants which independentlyconstruct solutions to the input problem The ants do this by using heuristic methods with occasionalrandom perturbations The search is guided by an N times N matrix P called the pheromone matrixwhose (i j)-th entry is a real number which indicates the desirability of scheduling job i in position jWhen all of the ants have finished their solutions are compared to determine which is the best (ie haslowest cost) The elements of P corresponding to this solution are then increased while other elementsare decreased after this a new generation of ants is started the modifications to P serve to guide thenew ants towards choices which have proved profitable in the past The entire process terminates whensome suitable condition is met for example a specified number of generations may have passed or thecurrent best solution may have failed to improve for a given number of generations

Single-colony ACO in Erlang We have implemented a single-colony ACO application (SMP-ACO)which runs on a single Erlang node Our implementation is based on [BBHS99 dBSD00 MM00] whichgive sequential ACO algorithms for solving the SMTWTP we have exploited Erlangrsquos concurrency toobtain a parallel version Each ant is implemented as an Erlang process and there is a single masterprocess which collects the results from the ants and compares them to find the best one once it hasdone this it uses the best solution to update the pheromone matrix and then starts a new generation ofants The pheromone matrix P is implemented as an ETS table with one entry for each row the rowsbeing represented by N -tuples of floats All of the ant processes read P but only the master processwrites to it The colony runs for a fixed number of generations which is supplied as a parameter(together with the number of ants)

322 Multi-colony approaches

The ACO method is attractive from the point of view of distributed computing because it can benefitfrom having multiple cooperating colonies each running on a separate compute node Having multiplecolonies increases the number of ants thus increasing the probability of finding a good solution butthere are other potential benefits as well For example different colonies can follow different strategiesone possibility is that one might choose to allow more randomness in certain colonies thus increasingthe chances of escaping from a solution which is locally optimal but not globally so We can also varythe topology of a network of colonies allowing us to explore how different ways of sharing informationaffect the quality of the solutions obtained

We have implemented four separate multi-colony ACO applications in Erlang In each of these theindividual colonies perform some number of local iterations (ie generations of ants) and then report

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 27: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 25

Master Process

Ant process NA

Ant process NA

Node1

Ant process 1

Ant process 1

Node Nc

Figure 16 Two-Level Distributed ACO

their best solutions the globally-best solution is then selected and is reported to the colonies which useit to update their pheromone matrices This process is repeated for some number of global iterationsOur four versions are as follows

bull Two-level ACO (TL-ACO) There is a single master node which collects the coloniesrsquo best solutionsand distributes the overall best solution back to the colonies Figure 16 depicts the processand node placements of the TL-ACO in a cluster with NC nodes The master process spawnsNC colony processes on available nodes In the next step each colony process spawns NA antprocesses on the local node In the figure the objects and their corresponding captions have thesame color As the arrows show communications between the master process and colonies arebidirectional There are IM communications between the master process and a colony processAlso IA bidirectional communications are done between a colony process and an ant process

bull Multi-level ACO (ML-ACO) In TL-ACO the master node receives messages from all of thecolonies and thus could become a bottleneck ML-ACO addresses this by having a tree ofsubmasters (Figure 17) with each node in the bottom level collecting results from a small numberof colonies These are then fed up through the tree with nodes at higher levels selecting the bestsolutions from among a number of their children

Figure 18 shows the process placement in the implemented ML-ACO If there are P processeson every sub-master node then the number of processes on level N is PN and the number ofnodes is PNminus1 A process on level L creates and monitors P processes on a node at level L + 1However the last level is an exception because it consists of only colony nodes and every colonynode has one colony process A process on level N-1 (one level prior the last) is responsible forP nodes on level N and consequently the number of nodes on level N is PN

To create a multi-level tree of sub-master nodes we need to find a relation between the numberof processes nodes and levels If the number of processes on each node is P and the numberof all available nodes is N then the number of levels X is the maximum X in the following

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 28: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 26

master process

colony nodes colony nodes

sub-master node sub-master node

sub-master node

Level 1

Level 0

Level 2

Level N-1

Level N In this level just colony nodes are located

represents a process

represents a node

represents a group of nodes

Figure 17 Node Placement in Multi Level Distributed ACO

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 29: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 27

Figure 18 Process Placement in Multi Level ACO

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 30: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 28

1+P+P2+P3++PXminus2+PX leN For example if P = 5 and N = 150 then the tree will have 3levels (ie 1 + 5 + 53 le 150) and only 131 nodes out of 150 can be used

bull Globally Reliable ACO (GR-ACO) This adds fault-tolerance using functions from Erlangrsquos globalmodule In ML-ACO if a single colony fails to report back the whole system will wait indefi-nitely GR-ACO adds supervision so that faulty colonies can be detected and restarted allowingthe system to continue its execution

bull Scalable Reliable ACO (SR-ACO) This also adds fault-tolerance but using the RELEASEprojectrsquos s groups [CLTG14] instead of the global methods which should improve scalabilityIn addition in SR-ACO nodes are only connected to the nodes in their own s group

The reliability of the GR-ACO and SR-ACO versions was tested using the Chaos Monkey tool [Lun12Hof10] Chaos Monkey randomly kills Erlang processes on a node by sending exit signals We ran ChaosMonkey on all computational nodes master sub-masters and colonies The results showed that thereliable ACO versions could survive all of the failures that Chaos Monkey caused Victim processesincluded all kind of processes the master sub-masters colonies and ants regardless of whether theywere initial processes or recovered ones

323 Evaluating Scalability

The question of how to measure scalability requires some discussionThere are two commonly-used notions of scalability for parallel and distributed systems

bull Strong scaling where one measures the change in overall execution time for an input of given sizeas the number of processors or nodes increases

bull Weak scaling where one considers the maximum problem size that can be solved in a given timeas the number of processors or nodes varies

Neither of these concepts is really applicable to distributed ACO systems since each colony performsthe same amount of work irrespective of how many other colonies there are Instead one obtains animprovement in the quality of solutions as the number of colonies increases This has been amplyestablished in the ACO literature see [KYSO00 MRS02 RV09 Del13 IB13] for example and inparticular [PNC11] which is a survey of 69 papers in this area We have also confirmed this with ourdistributed version (see Section 334)

So what is a sensible way to look at scaling for the ACO application In our distributed implemen-tations there are two alternating phases in the first phase the individual colonies work in isolation toconstruct solutions in the second phase (the coordination phase) solutions from the individual coloniesare collected and compared and the best result is broadcast back to the colonies Since the time takenby the colonies is approximately constant any change in execution time is due to the coordinationphase this in turn depends on (a) the particular coordination strategy (which differs in the variousversions discussed above) and (b) overheads introduced by communication in the distributed Erlangsystem Since the latter is one of the main focuses of RELEASE we take total execution time (withsome fixed number of iterations) to be a reasonable scalability metric for the ACO application indeedthis is arguably a more suitable metric than the usual notions of strong and weak scalability since thereis a clear dependence on communication time

Other metrics might be proposed for example one might run the program until it arrives at aknown optimal solution This particular metric would be highly unsuitable for our purposes howeverThe random nature of the ACO algorithm means that there could be significant variations in executiontime in repeated runs with the same inputs indeed the algorithm might get stuck at some locallyoptimal solution and fail to ever arrive at the global optimum and terminate

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 31: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 29

24

68

1012

1416

Number of colonies

Mea

n er

ror

()

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 19 Mean Error

324 Experimental Evaluation

A method which is commonly used in the ACO community to evaluate quality of ACO implementationsis to take a set of benchmarks whose optimal solutions are known and then to run your program onthem for some fixed number of iterations and observe how close the programrsquos solutions are to theoptimal ones This method is used in [dBSD00 MM00] for example For completeness we haveapplied this strategy to our TL-ACO application running on 256 nodes of EDFrsquos Athos cluster

We ran the TL-ACO application on 25 SMTWTP instances of size 100 from the ORLIB dataset(see [PVW91]9) gradually increasing their number of nodes from 1 to 256 All experiments were runwith 20 local and 20 global iterations these numbers were chosen after a small amount of experimenta-tion to find choices which gave good results in a reasonable time Figure 19 shows the mean differencein cost between our solutions and the optimal solutions it is clear that increasing the number of nodesincreases the quality of solutions although the trend is not strictly downwards due to the randomnature of the ACO algorithm (repeated runs with the same input may produce different solutions)Figure 20 shows the mean time taken for solution We see an upward trend due to increasing amountsof communication and the increasing time required to compare incoming results This is typical of thescaling graphs which we obtain for the ACO application

33 Performance comparison of different ACO and Erlang versions on the Athoscluster

This section discusses scaling data for TL-ACO ML-ACO GR-ACO and SR-ACO on EDFrsquos Athoscluster We ran each version with 1 10 20 250 compute nodes for each number of nodes we recordedthe execution time for either 7 or 10 runs (7 runs in later experiments to reduce the total time for thetest) and the plots below show the mean times for these runs In order to improve reproducibility we

9The ORLIB datasets can be downloaded from httppeoplebrunelacuk$sim$mastjjbjeborlibwtinfohtml

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 32: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 30

16

17

18

19

20

21

22

Number of colonies

Mea

n ex

ecut

ion

time

(s)

1 8 16 28 40 52 64 76 88 100 112 124 136 148 160 172 184 196 208 220 232 244 256

Figure 20 Execution time

removed non-determinacy by replacing the random number generator by a function which returned acyclic sequence of numbers (in fact this made little difference to execution times) There is still somevariation but this is typically only about 2ndash3 around the mean so we have reduced clutter in theplots by omitting it Every run used a fixed input of size 40 and we used 24 ants 40 local iterationsand 40 global iterations The experiments were controlled by a shell script which used ssh to launchErlang VMs on all of the allocated hosts and then supplied the names of the VMs as input to the mainACO program VMs were stopped and then restarted between executions

We ran each experiment with Erlang versions R15B OTP 174 and the RELEASE version of OTP174 Since SD Erlang is only available in the RELEASE version of OTP 174 this is the only Erlangversion for which we could measure the performance of SR-ACO There was not enough time to runevery combination of ACO and Erlang versions in a single SLURM job so we ran three separate jobswith one Erlang version (and all relevant ACO versions) per job

The execution times here were measured by the ACO program itself using Elrangrsquos timertcfunction and they omit some overhead for argument processing at the start of execution

331 Basic results

Figures 21ndash23 show the results for the various Erlang versions We see that in each case ML-ACOperforms slightly better than TL-ACO and the performance of GR-ACO is significantly worse thanboth of these In Figure 23 we see that the performance of SR-ACO is considerably better than all theother versions

These results are as we would hope GR-ACO uses global name registration which is known tocause performance problems TL-ACO uses a single master node which collects messages from all ofthe worker nodes and this can cause a bottleneck (see below) ML-ACO eliminates this bottleneckby introducing a hierarchy of submasters to collect results Both ML-ACO and TL-ACO use Erlangrsquosdefault distribution mechanism where every node is connected to every other one even if there is nocommunication between the nodes as explained earlier in SR-ACO we use SD-Erlangrsquos s groups to

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 33: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 31

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 21 R15B execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 22 OTP 174 execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 34: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 32

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 23 OTP 174 (RELEASE version) execution times Athos cluster

reduce the number of connections and we attribute SR-ACOrsquos superior performance to this factFor completeness Figures 24ndash26 show how the performance of each ACO version varies depending

on the Erlang version and as with the results for the Orbit benchmark (see sect314) we see that theOTP 174 versions perform more slowly than R15B We discuss this further in Appendix B wheresingle-machine experiments suggest that the cause is that the OTP 174 VM runs more slowly on IntelXeon machines than does the R15B VM

332 Increasing the number of messages

To examine the effect of the number of messages on performance we re-ran the experiments fromsect331 but configured the ACO application to transmit every message 500 times instead of just onceThis gives a rough idea of how the application would perform on a significantly larger cluster Timeconstraints prevented us from performing more exhaustive experiments on the Athos cluster but we didinvestigate how SD-Erlang affected network traffic on the GPG cluster in Glasgow see sect334 below

The results for the various ACO versions are shown in Figures 27ndash29 We see that the performanceof TL-ACO is very badly degraded which we would expect because of its single-master bottleneckApart from this the general situation is as in the previous section GR-ACO performs badly ML-ACOperforms quite well and SR-ACO performs very well Recall also that SR-ACO is fault-tolerant whileML-ACO is not failure of a single node in the latter will block the entire application but the SRversion can (often) recover from such events Thus both execution time and resilience are improved inSD-ACO

We also have graphs comparing performance across Erlang versions we have omitted these to savespace but they demonstrate similar phenomena to those seen in the previous section

333 Some problematic results

The results shown in Figures 21ndash23 in sect331 were in fact our second attempt to collect comprehensiveresults for the ACO application on Athos An earlier attempt produced much less satisfactory results

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 35: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 33

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 24 TL-ACO execution times Athos cluster

0 50 100 150 200 250

15

20

25

30

35

40

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 25 ML-ACO execution times Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 36: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 34

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

R15BOTP 174 (official)OTP 174 (RELEASE)

Figure 26 GR-ACO execution times Athos cluster

0 50 100 150 200 250

510

15

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 27 R15B execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 37: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 35

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 28 OTP 174 execution times messages x 500

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 29 OTP 174 (RELEASE version) execution times messages x 500

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 38: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 36

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 30 R15B execution times (2) Athos cluster

which illustrate a phenomenon which has caused us some difficultyThe results are shown in Figures 30ndash32 The reader will observe that there are large discontinuities in

the graphs with performance undergoing sudden drastic degradations as the number of nodes increases(The graph for SR-ACO in Figure 32 ends somewhat prematurely this is because we tested all fourACO versions in a single SLURM job and exceeded our eight-hour time limit In other tests we reducedthe number of repetitions of each test slightly to avoid this)

We attribute this behaviour to the fact that the Athos cluster was undergoing heavy usage at thetime of the experiments illustrated here but was much more lightly loaded when the experiments insect331 were performed (on a Saturday eveningSunday morning) We suspect that the Athos clusterhas a non-uniform (and probably hierarchical) communication topology We believe that there are atleast two factors in play here

Fragmentation of SLURM node allocations When the system is busy SLURM allocations (seethe start of sect3) are much more fragmented For example the allocation for the experiments in Figure 32was

atcn[141144181-184189-198235-286289-306325-347353-360363-366378387-396467-468541-549577-592595-598602611-648665667-684701-726729734-735771-776]

whereas the allocation for Figure 23 was

atcn[055-072109-144199-216235-252271-306325-342433-450458465-467505-522541-594667-684]

The highly-fragmented nature of the former allocation means that our compute nodes were (probably)much more spread out across the cluster and that this means that at certain points including a newmachine would mean that it was in a more ldquodistantrdquo region of the cluster in terms of communication

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 39: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 37

0 50 100 150 200 250

510

1520

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACO

Figure 31 OTP 174 execution times (2) Athos cluster

0 50 100 150 200 250

510

1520

25

Number of nodes

Exe

cutio

n tim

e (s

)

TLminusACOMLminusACOGRminusACOSRminusACO

Figure 32 OTP 174 (RELEASE version) execution times (2) Athos cluster

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 40: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 38

and so would take longer to report its results to the mastersubmaster nodes Since results have to bereceived from all colonies before a new iteration can proceed this would delay the entire application

Network traffic congestion Another issue is that irregularities in communication times becomemuch more apparent when there is a lot of communication occurring in the network Experimentsshow that when the network is congested communication between certain pairs of machines can taketen or more times longer than between other pairs This effect would combine very badly with thefragmentation effects mentioned above

We consider these issues in much greater depth in Deliverable D35 [REL15] where we suggesttechniques which we hope will enable Erlang applications to perform well in spite of heterogeneouscommunication patterns

334 Network Traffic

To investigate the impact of SD Erlang on network traffic we measured the number of sent and receivedpackets on the GPG cluster for three versions of ACO ML-ACO GR-ACO and SR-ACO Figures 33(a)and 33(b) show the total number of sent and received packets The highest traffic (the red line) belongsto GR-ACO and the lowest traffic belongs to SR-ACO (dark blue line) This shows that SD Erlangsignificantly reduces the network traffic between Erlang nodes Even with the s group name registrationSR-ACO has less network traffic than ML-ACO which has no global name registration

34 Summary

The results show that for both Orbit and ACO benchmarks SD Erlang versions consistently scalebetter than distributed Erlang ones (Sections 314 and 334) In addition applications on Intel Xeonmachines perform better running in ErlangOTP 174 than ErlangOTP R15B (Section 33) whereason AMD machines the performance is the opposite (Appendix B)

Our future plans concerning SD Erlang include the following

bull Analysing Orbit benchmark performance when there are more than one Erlang VM per host

bull Investigating dependency between the number of nodes the number of s groups and the numberof gateway processes

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 41: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 39

(a) Number of Sent Packets

(b) Number of Received Packets

Figure 33 Network Traffic in ML-ACO GR-ACO and SR-ACO

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 42: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 40

Figure 34 Runtimes of the Sim-Diasca City Case Study GPG Cluster

4 Measurements

41 Distributed Scalability

This section investigates the distributed performance of a Sim-Diasca instance executing the Cityexample As the system scales over multiple distributed hosts we measure the parallel runtimes andspeedups revealing some performance issues We investigate the impact of tuning VM parameters anduse generic tools to report core memory and network usage and Sections 42 and 43 use RELEASEtools to investigate further

The Sim-Diasca City instance is measured on the GPG and Athos clusters specified in Section 315and [GPG15] The GPG cluster at Glasgow University consists of 20 hosts where each host has 16 IntelXeon E5 2640 v2 processing units and 64GB of RAM We use Erlang OTP 174 in all our measurementsTo investigate resource consumption we employ standard Linux tools such as top and netstat

The specific Sim-Diasca instance is the newsmall scale of the City-example case ie the secondversion of the ldquosmallrdquo scale A new version was needed for that scale in order to boost the overallconcurrency level by introducing models that relies on a numerical solver which is more computationallyintensive The City case study has two phases ie initialisation and execution The initialisation phaseis excluded from our measurement to reduce the runtime and increase parallelism We used the followingcodes for the initialisation and execution phases respectively

bull make generate CASE SCALE=small EXECUTION TARGET=production

bull make run CASE SCALE=small CASE DURATION=short EXECUTION TARGET=production

411 Performance

To measure scalability we compare the runtime of Sim-Diasca City case study at different GPG clustersizes ie 1 2 4 8 12 and 16 nodes and hence with 16 32 64 128 172 and 256 cores Figure 34reports the runtimes of Sim-Diasca City case study up to 16 nodes (256 cores) We see from the figurethat the case study takes around 1000 minutes on single node whereas it falls below 300 minutes on16-nodes For pragmatic reasons we have not measured the runtime on a just one of the 16 cores on a

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 43: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 41

Figure 35 Speedup of the Sim-Diasca City Case Study GPG Cluster

GPG node but the single core measurements in Section 42 suggest that the maximum speedup wouldbe factor of 4 giving a runtime of around 41000 minutes ie 67 hours or nearly three days

Figure 35 shows the corresponding speedup relative to a single 16 core node It shows a maximumrelative speedup of 345 While the runtime of this Sim-Diasca instance continues to fall up to 16 nodesit does not efficiently utilise the distributed hardware For example the speedup figure shows that whilewe get a respectable speedup of 15 on 2 nodes(32 cores) this falls to 22 on 4 nodes (64 cores) andthereafter degrades to a maximum of just 345 on 16 nodes (256 cores)

While ideal speedups are not expected for such complex simulations it seems that there are dis-tributed performance issues that profiling could identify and help us to improve

412 Distributed Performance Analysis

Experiments showed that some runtime flags controlling the VM scheduling had an impact on theperformance of a given computing host First using the +sbt tnnps flag ie relying on thethread no node processor spread policy proved to be effective In this mode schedulers will bespread over hardware threads across NUMA nodes but schedulers will only be spread over processorsinternally in one NUMA node at a time

Discovering the optimal number of online schedulers required some extra measurements Figure 36shows the impact of tuning this parameter on runtimes on the Athos cluster It shows that not takinginto account the virtual cores provided by hyperthreading seems to be the best option here Thatis using the 12 actual cores shows better performances than using the 24 (virtual) ones that defaultsettings would select

We used conventional tools to investigate the impact of scaling the Sim-Diasca City instance overmultiple hosts on network core and memory usage The Linux top command is used to investigatethe core and memory usage as shown in Figures 37 and 38 The maximum core and memory usageare 2200 (22 out of 32 logical available cores) and 14 (896 GB) respectively for a single node Thememory usage on a single (GPG) host can be considered as a bottleneck that could cause problem forrunning larger instances on a single GPG host As the cluster grows both core and memory usagedecline as expected in a distributed systems

Figure 39 shows the network traffic ie the number of sent and received packets between nodes inthe cluster during the case study The number of sent and received packets are roughly the same and

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 44: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 42

Figure 36 Runtime Impact of the Number of Schedulers on Single-host Performance Athos Cluster

Figure 37 Core Usage of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 45: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 43

Figure 38 Memory Usage of the Sim-Diasca City Case Study GPG Cluster

Figure 39 Network Traffic of the Sim-Diasca City Case Study GPG Cluster

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 46: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 44

Figure 40 Runtimes of the Sim-Diasca City Case Study exploiting Virtual or Actual Cores AthosCluster

the network traffic increases as cluster size growsTo complement measurements made on the Glasgow GPG cluster similar studies were performed

on EDF Athos cluster as shown in Figure 40 Despite the latency induced by the network using onlyphysical cores is consistently more effective than using all (virtual cores) while increasing the numberof hosts beyond a dozen of them has little effect with that scale

413 Discussion

The runtime and speedup graphs show that the performance of the Sim-Diasca City instance improveswith scale up to around 16 hosts on both the GPG and Athos clusters (eg Figures 34 and 40) HoweverSim-Diasca does not efficiently utilise the distributed hardware While ideal speedups are not expectedfor such complex simulations we conclude that there are distributed performance issues to investigate

We use conventional tools to investigate the issues that arise when scaling the Sim-Diasca Cityinstance to multiple hosts Specifically we measure how scaling impacts network core and memoryusage The memory consumption results are perhaps the most interesting indicating that the Cityinstance consumes around 15 of the memory on a Beowulf node Hence it would not be possible torun a much larger instance on a single node and execution across multiple hosts would be required aspredicted in Figure 5 in Section 22 Sections 42 and 43 use RELEASE tools to investigate further

The Sim-Diasca City scalability issues arise at relatively small scales we start seeing reducedspeedups around 8 nodes These are not the network connectivity issues that emerge at scales of around60 nodes in the Orbit and ACO measurements in Section 3 the Riak study reported in [GCTM13]and are established in Erlang folklore Crucially the Sim-Diasca City instance would need to reach thislarger 60 node plus scale before we can usefully apply techniques such as SD Erlang s groups to improvescalability We do however present a preliminary design for applying SD-Erlang in Section 52

42 BenchErl

BenchErl [APR+12] is a scalability benchmark suite for applications written in Erlang developed byRELEASE to measure scalability of Erlang applications BenchErl was used with Sim-Diasca in order

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 47: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 45

to collect information about how the latter scales with more VM schedulers more hosts different OTPreleases or different command-line arguments

In order to run Sim-Diasca from BenchErl we first needed to overcome the following two problems

bull that the launching of all user and computing nodes was part of the application code and

bull that most of the parameters we wanted BenchErl to be able to control and vary were set by theengine ie either specified in its (Erlang) code or defined in the shell scripts used to run it

So what we did was move all the code that had to do with node launching from Sim-Diasca toBenchErl and then prevent Sim-Diasca from either launching any nodes itself or shutting down anynodes that it finds as soon as it starts running

Finally we needed to change the way Sim-Diasca was executed through BenchErl which was es-sentially the result of the changes described in Section 223

The results of running Sim-Diasca with BenchErl on a machine with four Intel(R) Xeon(R) E5-4650CPUs (270GHz) eight cores each (ie a total of 32 cores each with hyperthreading thus allowing theErlangOTP system to have up to 64 schedulers active at the same time) starting the system with thecommand line option +sbt tnnps and varying the number of runtime system schedulers from 1 to 64are shown in Figures 41 and 42

A quick look at these figures reveals that Sim-Diasca is not very scalable The time required forperforming the simulation in parallel drops quite significantly when using up to four schedulers (OSthreads) then improves only marginally on eight schedulers (which in this case are mapped to OSthreads running on cores on the same physical chip) and then deteriorates when using either thehyperthreads (logical cores of the chip) or when running on different chips and the NUMA effects startbecoming visible

43 Percept2

Percept2 is a concurrency profiling tool for Erlang applications developed by RELEASE and describedin Deliverable D52 Percept2 was used with Sim-Diasca in order to collect information about theinternals of the latter (eg how many processes Sim-Diasca spawns how much of its lifetime each oneof these processes spends waiting etc)

In order to profile Sim-Diasca in a distributed setting we needed to run Percept2 on each one ofthe computing nodes We had to make sure that Percept2 was started neither too early (so it wouldnot capture information that was not related to the actual simulation but rather for example to itssetup) nor too late (so it would not miss any information from the simulation execution)

In order to achieve this we made use of the plugin mechanism provided by Sim-Diasca we wrotea plugin that as soon as the simulation starts goes to each one of the computing nodes and startsPercept2 and as soon as the simulation ends it stops Percept2 on all computing nodes that are involvedin the simulation execution In this way we end up with one file for each computing node that we canlater analyse and visualise using Percept2

Once we added our plugin we ran the City-example simulation case with tiny scale and briefduration using only one computing node The reason we selected this particular setup was because wethought that if there were any problems related to the implementation of Sim-Diasca that Percept2could detect they would be detected even in this simple case Running a larger instance of SimDiascain Percept2 would generate an excessively large amount of profiling information which would have beennearly impossible to handle and analyse

We had Percept2 collect the following information during the execution of the simulation case

bull information about process activity and runnability

bull information about messages sent and received

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 48: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 46

(a) Execution time

(b) Speedup

Figure 41 BenchErl results running the lsquotinyrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 49: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 47

(a) Execution time

(b) Speedup

Figure 42 BenchErl results running the lsquosmallrsquo scale of City-simulation with duration lsquobriefrsquo

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 50: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 48

bull information about scheduler concurrency

bull information about the invocations of functions defined in class_TimeManager and class_Actormodules

Unfortunately collecting that particular information during the execution of that particular simu-lation case resulted in files that were too large for Percept2 to analyze (approximately 16GB) So wedecided to modify our Sim-Diasca plugin so that it waits for 10 seconds after the simulation startsstarts Percept2 on all computing nodes and then waits for 5 more seconds before stopping Percept2That way we ended up with a file that was much smaller than the previous one (approximately 85MB)that Percept2 could analyse but that contained information for approximately 5 seconds of the simu-lation execution

After examining the collected information we made the observation that all 263 processes spawnedby Sim-Diasca during those 5 seconds spent a very small part of their life-time running as shown inFigure 43

The underlying synchronisation schemes necessary for the simulation operation surely induce atleast part of this idle time as on a computing node a time manager process must wait for all itslocal model instances that happen to be scheduled at this logical moment (diasca) to report that theirbehaviour evaluation is over

A more complete study could tell whether most of these instances could be runnable but are notactually run by the Erlang VM in parallel as much as they could or if for example a few complexmodels could lead to some instances slowing down the whole process either because of the processingthey need or because of the exchanges and communication patterns they rely upon

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 51: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 49

Fig

ure

43

Ru

nn

ing

Sim

Dia

sca

inP

erce

pt2

T

he

tim

ed

uri

ng

wh

ich

ap

roce

ssis

exec

uti

ng

issh

own

ingr

een

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 52: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 50

5 Experiments

51 Deploying Sim-Diasca with WombatOAM

In early 2014 we created a version of Sim-Diasca that can be deployed with WombatOAM [Sol14]Below we first outline the problem the main challenge two alternative solutions to that challenge andthe reasons for picking one of them Then we show the steps to deploy Sim-Diasca with WombatOAMand execute a simulation

There are two WombatOAM features implemented since then that make it easier to deploy Sim-Diasca The first such feature is node name base templates By default WombatOAM auto-generatesthe names of the deployed nodes but since mid-2014 the users can specify a node name base templatewhen deploying nodes (The node names will then be made up of a node name base generated fromthe node name base template plus the host name) The second feature is called deployment hooksDeployment hooks allow the users to specify actions that should be taken before and after a node isstarted Both features have been implemented as part of D45 where they are described thoroughlyBelow we detail the deployment that worked when we originally implemented Sim-Diasca deploymentwhile pointing out the steps that are now easier thanks to these new features

The modifications in both tools follow their general license Ie modifications to Sim-Diasca areunder LGPL while modifications to WombatOAM are proprietary

511 The design of the implemented solution

The original Sim-Diasca uses its own deployment manager in the following way to deploy its nodesThe user starts the user node and asks it to execute a particular simulation with particular parametersThen the deployment manager (which runs on the user node) connects to the host machines starts thecomputing nodes and starts the simulation on them After the simulation is executed it collects theresult

Without any modifications to Sim-Diasca WombatOAM could deploy and monitor only the usernode It would not be able to help in the deployment of the computing nodes since that is performedby the user node After the nodes are deployed it can monitor them but not automatically since it isnot notified about the deployment of the computing nodes

In order to fully utilize WombatOAMrsquos capabilities we needed to perform small changes in Sim-Diasca we added a Sim-Diasca controller module to WombatOAM and we changed the order ofdeploying Sim-Diasca nodes (we deploy computing nodes first user node second) The reasons of thesechanges can be understood by looking at our main challenge and how we solved it

The main challenge was that when the user starts a user node and asks it to execute a simulationthe user nodersquos deployment manager is the one who deploys and starts the computing nodes It assumesthat the correct version of Erlang and Sim-Diasca are available on the host machine We would likeWombatOAM to perform this deployment step since WombatOAM Orchestration is specialized forthis and can provide more than Sim-Diascarsquos deployment manager it can provision virtual machinescopy Erlang and Sim-Diasca to those machines and automatically add the new nodes to WombatOAMMonitoring It was clear that using WombatOAM for deployment and monitoring required some changesin either Sim-Diasca or WombatOAM or both since at that time Sim-Diasca simply performed thedeployment on its own We were considering two possible options that would resolve this problem

1 The first option was to modify Sim-Diasca so that when a simulation is executed it asks Wombatto deploy the necessary computing nodes and then proceeds with executing the simulation itselfIn this solution Sim-Diasca would be in control and would give instructions to WombatOAM(namely about deploying nodes) Sim-Diasca would use WombatOAMrsquos API on the other handWombatOAM would not need to know anything special about Sim-Diasca After executing a

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 53: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 51

simulation Sim-Diasca would ask Wombat to terminate the computing nodes If the nodes werecreated on a cloud provider then terminating the computing nodes involved terminating theunderlying virtual machine instances too

2 The second option was to modify Sim-Diasca to assume that when it is asked to start a simulationthe necessary computing nodes are already there and it can simply skip the deployment stepsand proceed with executing the simulation on the nodes This would mean that WombatOAM(or one of its plugins) would be in control the user would ask WombatOAM to first deploy theamounts of computing nodes that he wants to use for a simulation and then deploy the user nodeand ask it to execute the simulation on the already existing computing nodes In this solutionSim-Diasca doesnrsquot know anything about WombatOAM but WombatOAM knows how to start asimulation on Sim-Diasca (which includes giving it a list of computing nodes)

Both solutions are viable but we chose and implemented the second one for the following reasons

bull Efficiency Usually the user wants to execute several simulations after each other If the virtualmachine instances are provisioned before each simulation and terminated after each simulationit is wasteful With the second solution the user has control over when the nodes are provisionedand terminated (Note that this still allows automation WombatOAMrsquos API can be used toeg provision a set of instances for computing nodes execute a certain amount of simulations insuccession on them and then terminate all instances automatically It is recommended howeverto restart the nodes between simulations in order to minimize the risk of a simulation executionaffecting the next simulation)

bull Interfacing Our load testing tool called Megaload uses WombatOAM for deployment and it hasits own web dashboard which uses WombatOAMrsquos API (See D66 for more information) Thissolution opens up the possibility of building a similar web dashboard for controlling the executionof Sim-Diasca simulations

bull This solution meant changing Sim-Diasca in a way that makes it easier to use other deploymentmethods as well not only WombatOAM

We wanted to keep the original deployment method of Sim-Diasca too so instead of changing Sim-Diascarsquos default behaviour we only introduced configuration entries which would make it behave in away that makes it usable from WombatOAM Namely the start nodes option was introduced whichasks the deployment manager to assume that the computing nodes that it received as a parameter arealready running thus they do not need to be deployed The other configuration option we introduced isthe use cookies option which inhibits the default Sim-Diasca behavior of generating random Erlangcookies on the user node and using those to connect with the computing nodes and instead makes theuser node use the given Erlang cookie Since the computing nodes are deployed before the user nodewhen Sim-Diasca is used with WombatOAM the value of this option needs to be the cookie used byall computing nodes (which need to use the same cookie)

512 Deployment steps

Prerequisites Sim-Diasca needs certain software packages to be installed on the host that executesthe user node These packages are listed in the Sim-Diasca manual

Meeting these dependencies is easy when deploying on an existing machine those packages canbe installed before using Sim-Diasca When deploying using a cloud provider this approach does notwork in that case one of two alternative solutions can be used The first is that when defining thedeployment domain in WombatOAM for the user node the user specifies a virtual machine image that

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 54: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 52

already has these software packages installed The other solution is to use deployment hooks to askWombatOAM to install these packages on each virtual machine instance after they are provisioned butbefore the user node is started on them Deployment hooks have been implemented as part of D45where they are described in detail

Deploying Sim-Diasca computing nodes The user needs to deploy an Erlang release that containsonly ErlangOTP itself no concrete Erlang program because later the Sim-Diasca user node willtransfer the simulation modules that will be used during the execution of the simulation

We considered not using a release at all just expecting the correct version of ErlangOTP to bealready present on the machine but transferring ErlangOTP was more in line with WombatOAMrsquosbehaviour in other deployments It is certainly an optimization worthy of investigation in the future notto copy the Erlang release to all instances but expect images to already contain it or let them acquire itin some other way eg by using a configuration management tool like Chef or Puppet or distributing itbetween each other using a peer-to-peer protocol like bittorrent However this experiment was focusingon deploying Sim-Diasca using a method that is as close to the usual WombatOAM deployments aspossible thatrsquos why we chose to deploy the Erlang release

This Erlang release is deployed in the usual WombatOAM way the providers are registered therelease is uploaded the node family is created and the nodes are deployed as detailed in D43 [REL14b]The only difference is that the user should not ask WombatOAM to start the node but he should startthe node by calling the wo orch sim diascastart computing nodes2 function The reason is thatSim-Diasca needs the nodes to have a certain name and that the name of the simulation should bepassed to the script that starts a computing node An example to naming the node is that if thesimulation called soda benchmarking test is executed on host 10001 with the user myuser thenthe name of the node should be Sim-Diasca Soda Benchmarking Test-myuser10001

To call this function the user can first attach to WombatOAMrsquos Erlang node which will provide itwith an Erlang shell

$ relwombatwombatbinwombat attachgt

Within the Erlang shell the following function call starts all nodes in the simdiasca comp nodefamily informing them in the process that they will execute the simulation soda benchmarking test

gt wo_orch_simdiascastart_computing_nodes(gt simdiasca_comp Node family of the computing nodesgt soda_benchmarking_test) Simulation name

The function also calculates the node names expected by Sim-Diasca and makes sure that the nodesare started with those names

Two new WombatOAM features eliminate the need for the start computing nodes function Sinceperforming the Sim-DiascandashWombatOAM integration work detailed here we have implemented nodename base templates which means that the users can specify the base part of the names of the nodeswhen deploying them (The base part is the node name without the hostname or host address) Thiseliminates the need for renaming the nodes We have also implemented deployment hooks with whichthe node can be started with the desired parameter which means that we no longer need to rely onthis function to supply the start script with the simulation name

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 55: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 53

Deploying Sim-Diasca user nodes When the computing nodes are ready the user needs to deploya user node to execute the simulation on them This involves uploading a Sim-Diasca Erlang releaseto WombatOAM defining a node family for the user node and deploying (but not starting) the usernode

Finally as in case of the compute nodes a function is called on the WombatOAM node to start theuser node

gt wo_orch_simdiascastart_user_nodes(simdiasca_user Node family of the user nodesoda_benchmarking_test Simulation name[10001 10002 10003]) Host names of the computing nodes

WombatOAM will start the Sim-Diasca user node with the simulation name as a parameter andgenerates a config file with the host names The user node calculates the nodes names from the hostnames and the simulation name Further automation would be possible by allowing the caller to passthe node family of the computing nodes instead of the host names and letting WombatOAM calculatethe list of host names By using a deployment hook the functionality of this function could be movedinto a function used as a deployment hook which means that the node could be started the usual way(from WombatOAMrsquos REST API or web dashboard)

Starting the user node will automatically start the execution of the simulation Logs of runningthe simulation will be placed in $HOMEsimdiascatxt and the result of the simulation will be storedin the simdiasca installation directory on the user node The user node stops after executing thesimulation (Its virtual machine instance is not terminated)

52 SD Erlang Integration

A Sim-Diasca simulation proceeds in discrete time units and hence the Time Manager simulationservices are one of the main components that impact the scalability of Sim-Diasca A promising strategyfor reducing the connectivity of large Sim-Diasca simulations is to introduce s groups to partition thetree of time managers We present and critique a possible design for reducing connectivity but have notimplemented or evaluated this approach as we have not yet reached a point where connectivity limitsthe scalability of Sim-Diasca instances (Section 413)

Time Managers schedule the events in a simulation and ensure uniform simulation time acrossall actors participating in the simulation The Time Manager service is based on a single root timemanager per simulation and on exactly as many local ie non-root time managers as there areremaining compute nodes There is no time manager on the user node As depicted in Figure 44the time managers form a tree where a time manager may be the parent of one or more other timemanagers The default height of the tree is one comprising one root and n non-root time managers

Time-managers have localised connections as they communicate only with their parent their chil-dren and any local actors However individual actors have the potential to break this localisation asany actor may communicate with any other actor and hence with any other node We would need toexperiment to discover the extent to which node connections and associated communication patternscan be effectively localised

Figure 45 presents a preliminary SD Erlang design for grouping the time manager processes inSim-Diasca The key idea is to create an hierarchy of Time Manager s groups This s group connectiontopology and the associated message routing between s groups is analogous to that of Multi-levelACO This is described and shown to be effective for providing scalable reliability in Section 322

To be more specific a Time Manager belongs to two s groups ie the s group of its parent andsiblings and also an s group containing its children This approach reduces the connections and sharednamespaces between nodes Each Time Manager s group would provide a gateway with processes thatroute messages to other s groups

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 56: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 54

Figure 44 The Hierarchical Structure of Sim-Diasca Time Managers

Figure 45 SD Erlang version of Sim-Diasca with an Hierarchy of Time Manager s groups

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 57: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 55

6 Implications and Future Work

In this deliverable we studied the scaling properties of a substantial case study the Sim-Diasca Cityinstance and two benchmarks ACO and Orbit The larger and more complex a program is themore involved its benchmarking becomes scalability constraints impact the measurement and profilingtools as well and managing the non-central parts of an application becomes increasingly difficultFor example the loading of the initial data corresponding to the larger scales of the City-example caserequired a few days of computation to complete making the benchmarking unwieldy at best Gatheringruntime metrics is a major issue at these scales a full profile containing Gigabytes of data and requiringa long time to analyse We addressed this problem by making measurements for only a short intervalas in Section 43 for example

By improving the knowledge about these applications and the scalability issues they experienceinterpretations were made preparing the removal of the next bottlenecks to be encountered and pro-moting some design patterns and good practices to enforce regarding scalability Language extensionslike SD Erlang offer a good hope that the next generation of the software that was studied in thisdeliverable will be able to take advantage of upcoming large-scale massive infrastructures similar tothe ones we have been able to benchmark in RELEASE

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 58: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 56

A Porting ErlangOTP to the Blue GeneQ

The Blue GeneQ is a massively parallel computer system from the Blue Gene computer architectureseries by IBM It is divided into racks in which nodes are specialized for either computation (these arethe so called compute nodes) or for handling IO (IO nodes) Additional hardware is used for thestorage subsystem and certain specialized nodes These are the service nodes and the front end nodesThe front end nodes are interactive login nodes for system users Only from these nodes a user can getaccess to the rest of the Blue GeneQ system The front end nodes also provide the necessary tools(eg compilers) and allow access to the job control system

Deliverable D22 of RELEASE described the changes to the ErlangOTPrsquos code base that wererequired for porting the system to the front end nodes [SKN+13 Section 53 and Appendix] of theBlue GeneQ It also outlined the challenges for porting the system on the compute nodes of themachine [SKN+13 Section 52]

The port of ErlangOTP to the front end nodes of the machine required replacing numerous POSIXsystem calls with workarounds to stay within the limitations of Blue GeneQrsquos CNK (Compute NodeKernel) operating system Of these changes the workaround with most impact is that every read()and write() system call needs to be executed in a spin loop to avoid blocking as CNK would deadlockunder many circumstances otherwise Similarly pipe() system calls have been replaced with manuallyconnected pairs of sockets which also must be accessed using spin loops

For running Erlang programs on just the front end nodes this port is fully functional in fact evendistributed Erlang can be run on these nodes Erlang nodes can also be started on the compute nodesof the machine Yet for connecting multiple Erlang nodes running on separate compute nodes of theBlue GeneQ system it is necessary to have a working network layer On CNK TCPIP breaks severalassumptions that are valid on POSIX operating systems Firstly due to the absence of fork() it isnot possible to spawn epmd (Erlang Port Mapper Daemon) to arbitrate which TCP port each Erlangnode should use Furthermore on CNK multiple compute nodes share the same IP address of theirassociated IO node Thus the port can not be hard-coded either Finally TCPIP communicationsuffers from severe performance and correctness issues on CNK when used for internal communicationOnly one compute node per IO node can simultaneously access the TCPIP stack making deadlocksa severe problem We therefore deemed necessary to use another communication layer one whichdoes not need IO nodes to be involved namely MPI For this to properly be explained first someunderstanding of the networking layer of the Erlang runtime system is required

A1 Basing ErlangOTPrsquos Distribution Mechanism on MPI

A distributed Erlang system consists of multiple Erlang nodes which may be located on different hosts(computer nodes) In the context of scaling distributed Erlang systems using multiple Erlang nodesper host is usually irrelevant as the SMPNUMA capabilities of each host can already be used by theErlang Virtual Machine As outlined before the normal TCPIP based network layer in the Erlangruntime system is not usable on the Blue GeneQ CNK To provide distributed Erlang with a differentnetworking back-end the Erlang runtime uses the concept of drivers which abstract away the actualcommunication and instead provide a table of function calls that enable the required functionality

We developed such a driver based on MPI whose implementation we outline in Section A2 belowLet us however first see how this driver can be used to replace ErlangOTPrsquos distribution mechanismwhich is based on TCPIP and the additional helper module that its use requires

Starting the Erlang Distribution Normally a distributed Erlang node is started using the com-mand erl -sname mynodename (or erl -name mynodename) Such a command is equivalent tostarting the Erlang node with just erl and then transforming it into a distributed node calling from

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 59: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 57

Erlang net kernelstart([mynodename shortnames]) (or longnames respectively)To use a different driver for example one called mpi instead one needs to start the Erlang node

with the command erl -no epmd -connect all false -proto dist mpi This will disableepmd only connect to nodes that we explicitly send messages to and activate the mpi dist Erlangmodule as the network driver After starting net kernelstart([mynodename shortnames])will bring up the networking layer as expected

However to run this MPI-based networking back-end on a cluster like Athos one also needs toknow the names of other nodes in the system which is not available at the time the job is submittedIt also proved to be desirable to be able to access properties of the MPI environment before the Erlangnet kernel module is fully initialized To this end we developed an mpihelper module

The MPI Helper Module This module provides the following helper functions startup0startup1 get index0 get world size0 and nodes0 The startup function takes a basename (mpinode by default) and builds a name basename ++ MPI index ++ hostnamewhich is passed to net kernel for initialization It is expected that startup is called from all Erlangnodes with the same base name To fully initialize the distributed Erlang system it passes messagesbetween each pair causing the connection to be properly initialized

Afterwards each node can get the set of all other nodes with mpihelpernodes() Additionally

mpihelperget world size() returns the number of nodes in total and

mpihelperget index() returns the unique MPI index number of the node which is also part ofits node name

A2 MPI Driver Internals

The main interface to the driver is the Erlang module mpi dist It uses the two modules mpi server(for gen server callbacks) and mpi (for interfacing the port program written in C) The MPI driverC code implements the following functions

init This function is called when the driver is initialized It sets up the MPI environment andinitializes required data structures This functionality is also exposed through the mpihelpermodule as mpihelpermpi startup()

start This function is called for every new port opened and only initializes the portrsquos data structures

stop This function is called whenever a port is closed and is currently not supported as MPI con-nections do not need to be closed

output This function is called when the port is in command mode and the Erlang node wants to writesomething to it It works by parsing a command byte from the output and relaying the remainderto the specified functionality The available commands are listen accept connect sendand receive (Additionally there is a possibility to do a passive send) Briefly

bull A listen call initializes a port to listening mode As the first call to listen only happensright after the MPI environment and the net kernel are fully initialized it broadcasts thenames of all Erlang nodes on first call This makes it possible to reverse-map the Erlangnode name to MPI index numbers

bull An accept call makes the port go to acceptor mode where it will wait to be contacted byanother Erlang node There is always at most one port in acceptor mode When accept iscalled a thread is spawned to handle the next incoming connection

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 60: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 58

bull Using connect an Erlang node can connect to an acceptor mode remote port A threadis spawned to handle the communication so that the Erlang runtime is not blocked by aconnection being established

bull The send and receive functions respectively send or receive data

ready input Receives some data or tells the runtime to call this function again once it believes thereto be data again

ready output Transmits all data currently buffered in the port

finish Called when the driver is unloaded Currently not supported due to limitations of MPI andits implementations

control Similar to output but called when the port is in control mode The available commandsare statistics command-mode intermediate-mode data-mode tick-messagenumber-for-listen-port and retrieve-creation-number In short

bull The statistics call is used to determine whether there was any communication on theport recently especially to cause tick messages when there was no recent traffic

bull The command-mode intermediate-mode and data-mode calls are used to switchbetween different states in the port

bull The number-for-listen-port and retrieve-creation-number calls always return0

bull Tick messages are sent every time the tick-message call is triggered

A3 Current Status of the Blue GeneQ Port

This MPI driver provides a proper replacement of TCPIP for the networking layer of ErlangOTP10

and as discussed is a prerequisite for Erlang nodes on different compute nodes of the Blue Gene totalk to each other Still the Blue GeneQ port is incomplete What remains is to find a replacementfor dynamically loaded drivers on that platform We are currently exploring alternatives

B Single-machine ACO performance on various architectures andErlangOTP releases

As seen in Sections 314 and 331 our benchmarking results for distributed Erlang applications sug-gest that recent ErlangOTP versions have poorer performance than older versions In particularErlangOTP 174 (released in December 2014) and the RELEASE version of ErlangOTP which amodified version of ErlangOTP 174 both give longer execution times than R15B (released in Decem-ber 2011) Crucially however all three versions have similar scalability curves modulo the runtimepenalty

In an as yet inconclusive attempt to isolate the cause of this phenomenon we have done sometests with the non-distributed version of the ACO application This runs on a single SMP machine andmakes no use of the RELEASE projectrsquos modifications to the Erlang distribution mechanism indeedit makes no use of Erlangrsquos distribution mechanism at all

We have used several different systems

10For example we have been able to run the distributed version of the Orbit benchmark (Section 31) on the Athoscluster using MPI-based distribution

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 61: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 59

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 46 EDF Xeon machines small executions

bull Machines in EDFrsquos ATHOS cluster These each have 24 Intel Xeon E5-2697 v2 processing unitsand 64GB of RAM

bull Machines in the GPG cluster at Glasgow University These each have 16 Intel Xeon E5 2640 v2processing units and 64GB of RAM

bull A machine at Ericsson with an AMD Opteron Processor 4376 HE with 8 cores and 32 GB ofRAM

bull A multicore machine at Heriot-Watt University called cantor which has an AMD Opteron 6248processor (48 cores) and 512GB of RAM

The Xeon machines are hyperthreaded with two processing units per core

B1 Experimental parameters

We ran the ACO application with an input of size 40 and with 50 generations of ants On each systemwe carried out a sequence of experiments with increasing numbers of ants (one Erlang process per ant)so that we could observe the effect of the number of processes on execution time There were two sizesof experiment

bull Small 1 10 20 30 1000

bull Large 1 500 1000 1500 100000

The small experiments have a relatively low number of processes similar to those in our distributedexperiments The large experiments use large numbers of concurrent processes to make sure that theErlang VM is fully exercised

The program was run 5 times with each number of ants and our graphs show the mean executiontime over these 5 runs for each number of ants We repeated the experiments with several VM releases

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 62: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 60

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 47 EDF Xeon machines large executions

00

02

04

06

08

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 48 EDF Xeon machines small executions erts +Muacul0 flag set

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 63: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 61

020

4060

80

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 49 EDF Xeon machines large executions erts +Muacul0 flag set

R15B (released in December 2011 just after the start of the RELEASE project) R16B03 OTP 170OTP 174 (the most recent official version at the time of writing in early 2015) and a version based onOTP 174 but including modifications from the RELEASE project

B2 Discussion of results

B21 EDF Xeon machines

Figures 46 and 47 show the results for the EDF Xeon machines It is clear that the 17 versionsuniformly run more slowly than R15B and R163 Closer analysis of the data shows that the ratio ofexecution times is roughly constant with OTP 174 taking about 126 longer than R15B

Having seen these results our colleagues at Ericsson suggested that some new default settings forthe Erlang runtime system in the OTP-17 versions might be affecting the VMrsquos performance and rec-ommended re-running the experiments with the +Muacul0 emulator flag to use the previous defaultsWe tried this and the results are shown in Figures 48 and 49 (the +Muacul0 flag was set for all versionsexcept R15B) Visually these graphs are essentially indistinguishable from Figures 46 and 47 Analysisof the numerical data shows that setting +Muacul0 does in fact improve performance slightly but onlyby about 07 for OTP 174 This clearly doesnrsquot explain the gap between R15B and OTP174

B22 Glasgow Xeon machines

To confirm these results we re-ran these experiments on Xeon machines at Glasgow The results areshown in Figures 50 and 51 and are very similar in form to the results from the EDF machines exceptthat the discrepancy is now about 15 on average

B23 AMD machines

We reported these results to our colleagues at Ericsson but they were unable to reproduce them Theyran our small experiments on an 8-core AMD machine and obtained the results in Figure 52 it is a

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 64: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 62

00

02

04

06

08

10

12

14

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 50 Glasgow Xeon machines small executions

020

4060

8010

012

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 51 Glasgow Xeon machines large executions

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 65: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 63

little difficult to see the details but the upper line is the result for R15B03 whereas the others are forR16B03 OTP 170 and OTP 174 the latter three being practically indistinguishable

To confirm this we ran our full set of experiments on a 48-core AMD machine at Heriot-WattUniversity The results are shown in Figures 53 and 54 The results here are somewhat more irregularprobably due to the fact that we did not have exclusive access to the machine and other users wererunning occasional small jobs However our results are similar to Ericssonrsquos R16B03 OTP 170 andOTP 174 have similar performance and all are considerably faster than R15B (R15B takes about 9longer than R16B03 and OTP 170 and about 6 longer than the official OTP 174 release) TheRELEASE version performs badly on the AMD machine taking about 8 longer than R15B and 15longer than the official OTP 174 version This contrasts strongly with the results for the EDF Xeonmachines where the official and RELEASE versions of OTP 174 have very similar performance (infact the RELEASE version is about 05 faster than the official version) and both are about 13slower than R15B

B3 Discussion

It thus seems that recent versions of ErlangOTP perform well on AMD architectures but compara-tively badly on Xeon architectures We have as yet been unable to determine the cause of this Onemight suspect that the Xeon machinesrsquo hyperthreading might be responsible but this seems not to bethe case We ran our experiments with the Erlang VM restricted to run only on the even-numberedCPUs (which would mean that only one CPU per core was being used) but still saw the same effect

We have also been unable to explain the bad performance of the RELEASE version in comparisonwith OTP 174 (on which it is based) on the AMD machine The main changes in the RELEASE versionare in the distribution system (to support s groups) and in the addition of DTrace probes to facilitatemonitoring Since our results were obtained using a single-machine version of the ACO program whichmade no use of the distribution system we suspect that the DTrace probes are responsible

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 66: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 64

Figure 52 Ericsson AMD machine small executions

00

02

04

06

08

10

12

Number of ants (11020301000)

Exe

cutio

n tim

e (s

)

0 100 200 300 400 500 600 700 800 900 1000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 53 Heriot-Watt AMD machine small executions

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 67: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 65

020

4060

8010

0

Number of ants (150010001500100000)

Exe

cutio

n tim

e (s

)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

R15BR16B03minus1OTP170OTP174 (Official)OTP174 (RELEASE)

Figure 54 Heriot-Watt AMD machine large executions

Change Log

Version Date Comments

01 31012015 First Version Submitted to Internal Reviewers

02 23032015 Revised version based on comments from all internal reviewers submittedto the Commission Services

10 27032015 Final version submitted to the Commission Services

References

[APR+12] Stavros Aronis Nikolaos Papaspyrou Katerina Roukounaki Konstantinos Sagonas Yian-nis Tsiouris and Ioannis E Venetis A scalability benchmark suite for ErlangOTP InProceedings of the Eleventh ACM SIGPLAN Workshop on Erlang pages 33ndash42 ACM 2012

[Bas14] Basho Riak 2014

[BBHS99] A Bauer B Bullnheimer RF Hartl and C Strauss An ant colony optimization approachfor the single machine total tardiness problem In Evolutionary Computation 1999 CEC99 Proceedings of the 1999 Congress on volume 2 1999

[CLTG14] N Chechina H Li P Trinder and A Ghaffari Scalable SD Erlang computation modelTechnical Report TR-2014-003 The University of Glasgow December 2014

[dBSD00] Matthijs den Besten Thomas Stutzle and Marco Dorigo Ant colony optimization for thetotal weighted tardiness problem In Marc Schoenauer Kalyanmoy Deb Gunther RudolphXin Yao Evelyne Lutton JuanJulian Merelo and Hans-Paul Schwefel editors ParallelProblem Solving from Nature PPSN VI volume 1917 of Lecture Notes in Computer Sciencepages 611ndash620 Springer Berlin Heidelberg 2000

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 68: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 66

[Del13] Pierre Delisle Parallel Ant Colony Optimization Algorithmic Models and Hardware Imple-mentations pages 45ndash62 Intech 2013

[DS04] Marco Dorigo and Thomas Stutzle Ant Colony Optimization Bradford Company ScituateMA USA 2004

[GCTM13] Amir Ghaffari Natalia Chechina Phil Trinder and Jon Meredith Scalable persistentstorage for Erlang Theory and practice In Proceedings of the Twelfth ACM SIGPLANWorkshop on Erlang Erlang rsquo13 pages 73ndash74 New York NY USA 2013 ACM

[GPG15] GPG Cluster 2015 httpwwwdcsglaacukresearchgpgclusterhtm

[Hof10] Todd Hoff Netflix Continually Test by Failing Servers with Chaos Mon-key httphighscalabilitycomblog20101228netflix-continually-test-by-failing-servers-with-chaos-monkehtml December 2010

[IB13] Sorin Ilie and Costin Badica Multi-agent approach to distributed ant colony optimizationScience of Computer Programming 78(6)762ndash774 2013

[KYSO00] H Kawamura M Yamamoto K Suzuki and A Ohuchi Multiple ant colonies algorithmbased on colony level interactions IEICE Transactions on Fundamentals of ElectronicsCommunications and Computer Sciences E83-A(2)371ndash379 2000

[LN01] Frank Lubeck and Max Neunhoffer Enumerating large Orbits and direct condensationExperimental Mathematics 10(2)197ndash205 2001

[Lun12] Daniel Luna Chaos Monkey httpsgithubcomdLunachaos_monkey 2012

[MC98] Jeff Matocha and Tracy Camp A taxonomy of distributed termination detection algorithmsJournal of Systems and Software 43(221)207ndash221 1998

[McN59] Robert McNaughton Scheduling with deadlines and loss functions Management Science6(1)1ndash12 1959

[MM00] Daniel Merkle and Martin Middendorf An ant algorithm with a new pheromone evaluationrule for total tardiness problems In Proceedings of EvoWorkshops 2000 volume 1803 ofLNCS pages 287ndash296 Springer Verlag 2000

[MRS02] Martin Middendorf Frank Reischle and Hartmut Schmeck Multi colony ant algorithmsJournal of Heuristics 8(3)305ndash320 2002

[PNC11] Martin Pedemonte Sergio Nesmachnow and Hector Cancela A survey on parallel ant colonyoptimization Appl Soft Comput 11(8)5181ndash5197 2011

[PVW91] C N Potts and L N Van Wassenhove Single machine tardiness sequencing heuristics IIETransactions 23(4)346ndash354 1991

[REL14a] RELEASE Project Deliverable D34 Scalable Reliable OTP Library Release September2014

[REL14b] RELEASE Project Deliverable D43 Heterogeneous Super-cluster Infrastructure July 2014

[REL15] RELEASE Project Deliverable D35 Performance Portability Principles February 2015

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion
Page 69: D6.2 (WP6): Scalability Case Studies: Scalable Sim …release-project.softlab.ntua.gr/documents/D6.2.pdfICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software

ICT-287510 (RELEASE) 23rd December 2015 67

[RV09] NR Srinivasa Raghavan and M Venkataramana Parallel processor scheduling for minimiz-ing total weighted tardiness using ant colony optimization The International Journal ofAdvanced Manufacturing Technology 41(9-10)986ndash996 2009

[SKN+13] Konstantinos Sagonas David Klaftenegger Patrik Nyblom Nikolaos Papaspyrou KaterinaRoukounaki and Kjell Winblad Deliverable D22 Prototype Scalable Erlang VM ReleaseApril 2013

[Sol14] Erlang Solutions WombatOAM-enabled Sim-Diasca httpsgithubcomrelease-projectsimdiasca 2014

  • Executive Summary
  • The main case study
    • Sim-Diasca Overview
    • City Example
      • Overview of the simulation case
      • Description of the simulated elements
      • Additional changes done for benchmarking
          • Benchmarks
            • Orbit
              • Running Orbit on Athos
              • Distributed Erlang Orbit
              • SD Erlang Orbit
              • Experimental Evaluation
              • Results on Other Architectures
                • Ant Colony Optimisation (ACO)
                  • ACO and SMTWTP
                  • Multi-colony approaches
                  • Evaluating Scalability
                  • Experimental Evaluation
                    • Performance comparison of different ACO and Erlang versions on the Athos cluster
                      • Basic results
                      • Increasing the number of messages
                      • Some problematic results
                      • Network Traffic
                        • Summary
                          • Measurements
                            • Distributed Scalability
                              • Performance
                              • Distributed Performance Analysis
                              • Discussion
                                • BenchErl
                                • Percept2
                                  • Experiments
                                    • Deploying Sim-Diasca with WombatOAM
                                      • The design of the implemented solution
                                      • Deployment steps
                                        • SD Erlang Integration
                                          • Implications and Future Work
                                          • Porting ErlangOTP to the Blue GeneQ
                                            • Basing ErlangOTPs Distribution Mechanism on MPI
                                            • MPI Driver Internals
                                            • Current Status of the Blue GeneQ Port
                                              • Single-machine ACO performance on various architectures and ErlangOTP releases
                                                • Experimental parameters
                                                • Discussion of results
                                                  • EDF Xeon machines
                                                  • Glasgow Xeon machines
                                                  • AMD machines
                                                    • Discussion