Performance Optimisation of Discrete-Event Simulation ... › smash › get › diva2:954999 ›...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2016

Performance Optimisation ofDiscrete-Event Simulation Softwareon Multi-Core Computers

ALAIN E. KAESLIN

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Performance Optimisation ofDiscrete-Event Simulation Software

on Multi-Core Computers

Prestandaoptimering avhändelsestyrd simuleringsmjukvara

på flerkärniga datorer

Alain E. [email protected]

Degree Project in Computer Science and CommunicationSecond Cycle, 30 Credits

Master’s Programme in Computer Science

Supervisor: Prof. Dr. Stefano MarkidisExaminer: Prof. Dr. Erwin LaurePrincipal: Systecon ABContact Person at Principal: John Josefsson

StockholmJune 26, 2016

Abstract

SIMLOX is a discrete-event simulation software developed by Systecon AB foranalysing logistic support solution scenarios. To cope with ever larger problems,SIMLOX’s simulation engine was recently enhanced with a parallel executionmechanism in order to take advantage of multi-core processors. However, thisextension did not result in the desired reduction in runtime for all simulationscenarios even though the parallelisation strategy applied had promised linearspeedup. Therefore, an in-depth analysis of the limiting scalability bottlenecksbecame necessary and has been carried out in this project. Through theuse of a low-overhead profiler and microarchitecture analysis, the root causeswere identified: atomic operations causing a high communication overhead,poor locality leading to translation lookaside buffer thrashing, and hot spotsthat consume significant amounts of CPU time. Subsequently, appropriateoptimisations to overcome the limiting factors were implemented: eliminating theexpensive operations, more efficient handling of heap memory through the use ofa scalable memory allocator, and data structures that make better use of caches.Experimental evaluation using real world test cases demonstrated a speedup ofat least 6.75x on an eight-core processor. Most cases even achieve a speedup ofmore than 7.2x. The various optimisations implemented further helped to lowerrun times for sequential execution by 1.5x or more. It can be concluded thatachieving nearly linear speedup on a multi-core processor is possible in practicefor discrete-event simulation.

III

Sammanfattning

SIMLOX är en kommersiell mjukvara utvecklad av Systecon AB, varshuvudsakliga funktion är en händelsestyrd simuleringskärna för analys avunderhållslösningar för komplexa tekniska system. För hantering av storaproblem så används parallellexekvering för simuleringen, vilket i teorin bordege en nästan linjär skalning med antal trådar. Prestandaförbättringen somobserverats i praktiken var dock ytterst begränsad, varför en ordentlig analysav skalbarheten har gjorts i detta projekt. Genom användandet av ettprofileringsverktyg med liten overhead och mikroarkitektur-analys, så kundeorsakerna hittas: atomiska operationer som skapar mycket overhead förkommunikation, dålig lokalitet ger fragmentering vid översättning till fysiskaadresser och dåligt utnyttjande av TLB-cachen, och vissa flaskhalsar somkräver mycket CPU-kraft. Därefter implementerades och testade optimeringar föratt undvika de identifierade problem. Testade lösningar inkluderar elimineringav dyra operationer, ökad effektivitet i minneshantering genom skalbaraminneshanteringsalgoritmer och implementation av datastrukturer som gerbättre lokalitet och därmed bättre användande av cache-strukturen. Verifieringpå verkliga testfall visade på uppsnabbningar på åtminstone 6.75 gånger påen processor med 8 kärnor. De flesta fall visade på en uppsnabbning med enfaktor större än 7.2. Optimeringarna gav även en uppsnabbning med en faktorpå åtminstone 1.5 vid sekventiell exekvering i en tråd. Slutsatsen är därmed attdet är möjligt att uppnå nästan linjär skalning med antalet kärnor för denna typav händelsestyrd simulering.

IV

Acknowledgements

I would like to express my gratitude to my supervisor Prof. Dr. Stefano Markidisand my examiner Prof. Dr. Erwin Laure from KTH Royal Institute of Technology fortheir support throughout this thesis. Additionally, I would like to thank Stefano andErwin for organising the “PDC Summer School 2015” course in which I had the greatpleasure to take part in. Without the knowledge acquired there, this thesis would nothave been possible.

My gratitude also goes to Tomas Eriksson, John Josefsson and the whole team atSytecon AB for making this project possible, their continuous assistance and for lettingme work with their software.

V

Acronyms

API Application Programming Interface

AVX Advanced Vector Extensions

COW Copy-on-Write

CPI Cycles per Instruction Retired

CUDA Compute Unified Device Architecture

DLL Dynamic Link Library

DTLB Data Translation Lookaside Buffer

DVFS Dynamic Voltage and Frequency Scaling

FPGA Field Programmable Gate Array

FPU Floating Point Unit

GPU Graphics Processing Unit

HLA High-Level Architecture for Modelling and Simulation

HT Hyper-Threading

ICL Boost Interval Container Library

ILP Instruction-Level Parallelism

ITLB Instruction Translation Lookaside Buffer

IU Integer Unit

L1D Level 1 Data Cache

L1I Level 1 Instruction Cache

L1 Level 1

L2 Level 2

L3 Level 3

MESIF Modified, Exclusive, Shared, Invalid and Forward

VI

MESI Modified, Exclusive, Shared and Invalid

MFC Microsoft Foundation Class Library

MIC Many Integrated Core

MPI Message Passing Interface

MRIP Multiple Replications in Parallel

OpenMP Open Multi-Processing

PGO Profile-Guided Optimisation

POSIX Portable Operating System Interface

PRNG Pseudo-Random Number Generator

RAII Resource Acquisition Is Initialisation

RTI Runtime Infrastructure

SIMD Single Instruction, Multiple Data

SMT Simultaneous Multithreading

SRIP.F Single Replication in Parallel through Functional Decomposition

SRIP.M Single Replication in Parallel through Model Decomposition

SRIP Single Replication in Parallel

STLB Second Level Translation Lookaside Buffer

STL C++ Standard Template Library

TBB Intel Threading Building Blocks

TLB Translation Lookaside Buffer

WMI Windows Management Instrumentation

VII

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Objectives and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Ethical and Sustainability Considerations . . . . . . . . . . . . . . . . . 31.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Discrete-Event Simulation 42.1 Simulation Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Discrete-Event Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Parallel Discrete-Event Simulation . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Multiple Replications in Parallel . . . . . . . . . . . . . . . . . . 62.3.2 Single Replication in Parallel . . . . . . . . . . . . . . . . . . . . 7

3 Parallel Computing 83.1 Parallel Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 Instruction-Level Parallelism . . . . . . . . . . . . . . . . . . . . 83.1.2 Simultaneous Multithreading . . . . . . . . . . . . . . . . . . . . 93.1.3 Multi-Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.4 Distributed Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.5 Virtual Memory and Translation Lookaside Buffer . . . . . . . . 143.1.6 Vector Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.7 Accelerators, Coprocessors and Many-Core . . . . . . . . . . . . 173.1.8 Dynamic Voltage and Frequency Scaling . . . . . . . . . . . . . . 19

3.2 Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.1 Task-Oriented Programming Models . . . . . . . . . . . . . . . . 193.2.2 Locality of Reference . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Related Work 22

5 Optimising Scalability 255.1 SIMLOX Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 Result Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.1.2 Parallelism in SIMLOX . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Iterative Optimisation Process . . . . . . . . . . . . . . . . . . . . . . . 275.2.1 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2.2 Test Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

VIII

5.3 Profiling and Microarchitecture Analysis . . . . . . . . . . . . . . . . . . 295.4 Performance Pitfalls in Parallel Programming . . . . . . . . . . . . . . . 315.5 Symptoms of Scalability Issue in SIMLOX . . . . . . . . . . . . . . . . . 325.6 Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.6.1 Choosing a Compiler and Compiler Options . . . . . . . . . . . . 325.6.2 Using a Scalable Memory Allocator . . . . . . . . . . . . . . . . . 325.6.3 Eliminating Unintended Write Sharing . . . . . . . . . . . . . . . 345.6.4 Using Profile-Guided Optimisation . . . . . . . . . . . . . . . . . 355.6.5 Eliminating Translation Lookaside Buffer Thrashing . . . . . . . 365.6.6 Thread Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.6.7 Shorten Sequential Startup Phase . . . . . . . . . . . . . . . . . 385.6.8 Refactor Preventive Maintenance Algorithm . . . . . . . . . . . . 40

5.7 Further Optimisations Evaluated . . . . . . . . . . . . . . . . . . . . . . 405.7.1 Auto Vectorisation . . . . . . . . . . . . . . . . . . . . . . . . . . 405.7.2 Disable Log Content Generation . . . . . . . . . . . . . . . . . . 40

6 Results 416.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.2 Memory Footprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.3 Impact of Turbo Boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.4 Impact of Simultaneous Multithreading . . . . . . . . . . . . . . . . . . 54

7 Discussion and Conclusions 577.1 Summary of Method and Results . . . . . . . . . . . . . . . . . . . . . . 577.2 Discussion of Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . 587.3 Limitations and Further Work . . . . . . . . . . . . . . . . . . . . . . . . 59

References 60

IX

Chapter 1

Introduction

Discrete-event simulation is commonly used to analyse the behaviour over time of acomplex system if the mathematical analysis becomes intractable. It is a powerfultool for analysing the impacts of resource constraints in a variety of environments.Executing a simulation is often associated with a significant computational workloadand therefore using multi-core processors efficiently is important.

1.1 BackgroundReal world systems that include maintenance activities, supply chains or transportationare difficult to analyse using analytical methods. Discrete-event simulation has provento be an efficient method for analysing this type of operation. SIMLOX is a commercialdiscrete-event simulation software by Systecon AB. It is mainly used for planning andoptimising the logistic support solution for a technical system and dimensioning itsresources. The unique features and complexity of SIMLOX lie in its comprehensivesimulation model which usually consists of at least the following [EW15]:

• A technical system (e.g. a train and its subcomponents including their failurerates)

• A support organisation (e.g. maintenance personnel, workshops, tools, repairtimes, spare part stock levels etc.)

• An operations model (e.g. timetable for train operations/missions including forexample rush hours and weekend traffic)

End users can determine if a given operations and logistics scenario results inthe desired system availability and identify potential bottlenecks. It is importantto consider all the factors mentioned above in a simulation in order to be able todraw accurate conclusions, since even a “very reliable system can still experience alow availability with a poor support system and vice versa” [EW15]. The simulationmodel is highly configurable and hence a broad range of scenarios can be modelled.SIMLOX has been successfully applied in sectors such as rail transport [Bor06], windenergy [Joh13][HM11] and others at Systecon and external customers.

1

1.2 MotivationIn recent times Systecon has observed an increasing number of large problem instancesthat lead to long simulation runtimes. The reasons for this increase are manifold. Forexample, customers do not only simulate a few months of steady-state operation butinstead perform lifecycle planning over many years. This could include modellingphasing-out of an old generation system and phasing-in of its successor. It is notuncommon that such an analysis requires a simulation period of five years or more.Other customers require an increased level of detail in their simulation or use SIMLOXto conduct a sensitivity analysis by studying the impact of input parameter variations.

To deal with this additional computational burden, Systecon has recently addedsupport for multithreading in SIMLOX. Unfortunately, this did not result in theanticipated performance benefit for all simulation cases.

1.3 Objectives and GoalsThe goals of this master thesis are:

• Gain understanding of why the parallel performance of the multithreadedSIMLOX version fell short of expectations.

• Identify the scalability bottlenecks in SIMLOX and suggest approaches on howto alleviate them.

• Verify the suggested improvements by implementing them experimentally.It is outside of the scope of this project to provide a production-qualityimplementation.

In summary the following question is aimed to be answered: What factorscontribute most to the worse-than-expected speedup and does their elimination leadto a significant improvement?

This thesis considers only the simulation engine of SIMLOX; other componentssuch as the graphical user interface are outside of the scope. Focus is on improvingscalability of large problem instances rather then reducing runtime in general.

1.4 ContributionsThe highlights of this master thesis are:

• Few publications deal with the technical aspects of efficiently implementing theMultiple Replications in Parallel (MRIP) parallelisation strategy for discrete-event simulation on modern multi-core architectures. This work provides anoverview of common issues by using SIMLOX as an example.

• An in-depth analysis of the scalability bottlenecks in SIMLOX is given. Thereport further discusses how they have been successfully eliminated such thatnear-linear speedup can be achieved.

• Different facets of efficient cache and memory usage are discussed, namely theimpact of the memory allocator, synchronisation overhead originating from spacesaving optimisations, and the consequences of shared caches on virtual addresstranslation.

2

• The effects on scalability of the compiler, task scheduler and SimultaneousMultithreading (SMT) as well as Dynamic Voltage and Frequency Scaling(DVFS) are analysed.

1.5 Ethical and Sustainability ConsiderationsThis thesis deals solely with reducing the runtime of the SIMLOX software. Therefore,no ethical issues arise directly from the results presented here.

However, end-users of simulation software might face difficult to answer ethicalquestions, when making decisions based on simulation results. They have to carefullyponder the validity and applicability of the simulation results, since use, misuse andnon-consideration of simulation results can have severe consequences.

As the references in section 1.3 highlight, the SIMLOX software can clearly help toincrease the sustainability of complex systems for energy production, transportationetc.

1.6 OutlineChapter 2 introduces the fundamentals of discrete-event simulation and gives anoverview of parallelisation strategies for this type of problem in section 2.3. Theadvantages and disadvantages of the respective approaches are discussed.

Chapter 3 begins with discussing the most important aspects of parallel computerarchitecture in section 3.1. Special attention is given to shared resources such as cachesand other factors affecting the parallel performance of programs. In section 3.2 task-oriented programming and locality are discussed. Both are important concepts forefficient parallel programming.

Chapter 4 contains an overview of related work. After quickly introducing IntelThreading Building Blocks (TBB), the focus is shifted onto other publications dealingwith parallel discrete-event simulation software.

Chapter 5 starts with providing background information on the implementationof SIMLOX in section 5.1. In section 5.1.2 the multithreaded simulation engine ofSIMLOX is presented. Section 5.2 and 5.3 give an overview of the optimisation process,test cases, test hardware, and tools used in this project. Section 5.4 is concerned withcommon performance problems in parallel programming and whether they are likely toaffect SIMLOX or not. Finally, the optimisations implemented as part of this projectare presented in section 5.6 and 5.7.

Chapter 6 shows the impacts of the optimisations individually for each test cases.Section 6.1 discusses the improved scalability followed by section 6.2 which detailsimpact on memory consumption. The effects of Dynamic Voltage and FrequencyScaling (DVFS) and Simultaneous Multithreading (SMT) are discussed in section 6.3.

Finally, chapter 7 summarises the results from chapter 6 and lessons learned fromthe project.

3

Chapter 2

Discrete-Event Simulation

After giving an overview of how discrete-event simulation relates to other simulationtechniques in section 2.1, the fundamentals of this method are introduced in section 2.2.The chapter then compares the two most important parallelisation strategies fordiscrete-event simulation: Multiple Replications in Parallel (MRIP) and SingleReplication in Parallel (SRIP) in section 2.3.1 and 2.3.2.

2.1 Simulation TaxonomySimulation techniques can be classified using the three properties shown in figure 2.1[SYB04]. In a static simulation time is not relevant. This contrasts with dynamicsimulation where the behaviour over time is analysed. In a continuous simulationstate variables are allowed to change continuously. As opposed to this, changes onlyoccur at specific moments of time in discrete simulation, that is, they are not allowedto evolve during the intervals in between. The length of intervals can either be fixed(time-driven simulation) or irregular (event-driven). Finally, the behaviour propertydescribes if randomness influences the result. Hence, deterministic simulations arerepeatable, whereas probabilistic simulations are not. This thesis is concerned withdynamic, probabilistic, event-driven simulation.

2.2 Discrete-Event SimulationA simulator based on the discrete-event paradigm usually consists of the followingcomponents [AM10]:

State S describing the model.

Clock C storing the current simulation time. After an event has been processed, theclock advances to the next event’s occurrence time.

Events E are instantaneous and assigned at least an:

Occurrence time E.t representing when the event is processed in simulationtime.

4

Simulation

Presence of Time Basis of Value Behaviour

Static Dynamic Continuous Discrete

Time-Driven Event-Drivena.k.a. Discrete-Event

Deterministic Probabilistic

Figure 2.1: Simulation Taxonomy according to [SYB04]

Event procedure E.p is executed when an event is processed. The procedurecan manipulate both the state and event queue (by inserting or removingevents). The outcome of the procedure is dependent on the state of themodel. In case of a probabilistic simulation the outcome is dependent onthe random numbers drawn.

Event Queue Q containing all events to be processed sorted by their time ofoccurrence. Hence, the next event to be executed is always at the head position.This event is referred to as “most imminent event”. The event queue is sometimesreferred to as “event list” or “sequencing set” (SQS).

Given those components, a simple discrete-event simulator, such as the one shownin algorithm 1, can be implemented. An execution is conceptually shown in figure 2.2.

Algorithm 1 Simple Discrete-Event Simulation Algorithm1: S ← InitialiseState()2: C ← InitialiseClock()3: Q← InitialiseEventQueue()4: while Q.IsNotEmpty() do5: E ← Q.Pop() ▷ E is set to most imminent event6: C ← E.t ▷ Advances clock C to E’s occurrence time E.t7: E.p() ▷ E’s event procedure E.p modifies state S and event queue Q8: end while

Probabilistic Discrete-Event Simulation

In a probabilistic simulation a single execution of the simulation does usually notyield meaningful results, since the outcome of the simulation is dependent on therandom numbers drawn. To overcome this issue the simulation is replicated severaltimes and the outputs of many replications are aggregated through statistical analysis(typically averages and confidence intervals are calculated). As a source of randomness,Pseudo-Random Number Generators (PRNGs) are often used, rather than true random

5

(a) Event E0 is currently processed. Its eventprocedure inserts a new event E1 into theevent queue.

(b) Event E1 is currently processed. Its eventprocedure inserts two new events E2 and E3

into the event queue.

(c) Event E2 is currently processed. Its eventprocedure inserts a new event E4 and deletesevent E3.

(d) The final event E4 is processed.

Figure 2.2: Simplified example of a discrete-event simulation. The clock symbolindicates the current simulation time.

numbers. In such implementations care needs to be taken in order to ensure that everyreplication initialises the PRNG with a unique seed.

2.3 Parallel Discrete-Event SimulationMultiple Replications in Parallel (MRIP) and Single Replication in Parallel (SRIP)are the two main parallelisation strategies for discrete-event simulation distinguishedin literature.

2.3.1 Multiple Replications in ParallelMultiple workers run replications (see section 2.2) of the same simulation usingdifferent seeds in parallel in the Multiple Replications in Parallel (MRIP) approach.Such a workload is usually embarrassingly parallel, i.e. it can easily be divided intoindependent subproblems [WA05]. A simulator implementing MRIP usually needs todeal with the following problems: distribution of seeds to the Pseudo-Random NumberGenerators (PRNGs) and load balancing. MRIP has the following limitations:

• the system efficiency is poor in case the number of processors does not dividethe number of replications

• it is assumed that the simulation model fits into main memory

• system efficiency can be poor if there is a significant difference in runtime betweenreplications (load imbalance)

6

Unless those conditions are fulfilled, a Single Replication in Parallel (SRIP)approach might be more suitable.

The usage of graphics cards (see section 3.1.7) as accelerators is unattractive,since the different seeds result in quickly diverging branches. Intel Many IntegratedCore (MIC) (see section 3.1.7) is possibly better suited for MRIP parallelisation as theperformance penalty resulting from branch divergence is significantly lower than onGPUs. However, to fully exploit MIC’s performance the program “must also be ableto make use of the vector units efficiently” [Rah13].

2.3.2 Single Replication in ParallelThere are two main strategies for parallelising a single replication: either throughfunctional decomposition (SRIP.F) or by decomposing the model (SRIP.M).

Single Replication in Parallel through Functional Decomposition

In Single Replication in Parallel through Functional Decomposition (SRIP.F)individual functions of the simulator (such as random-number generators, event listhandling, statistical data analysis, etc.) are executed in parallel. It is a poorparallelisation strategy because speedup can at most be equal to the number offunctions that are executed in parallel. Parallelism that grows as the problem sizegrows, should always be preferred.

Single Replication in Parallel through Model Decomposition

In Single Replication in Parallel through Model Decomposition (SRIP.M) parallelismis achieved by partitioning the simulation model. Events are handled in parallel andnew events as well as changes to the state are communicated between the concurrentworkers. This poses the question when it is safe to process events concurrently indiscrete-event simulation based on algorithm 1 on page 5. [Gho14] defines the followingcriterion that must be met:Criterion 1 If E1 and E2 are two events in a simulation, and E1 ≺ E2, then theevent E1 must be simulated before the event E2.

Where ≺ is the “causally ordered before” operator as defined by [Lam78]. Itfollows that “two events that are not causally ordered in the physical system can besimulated in any order” [Gho14]. Two classes of algorithms for SRIP.M have emerged[Nut11][JAGP12]:Conservative approach: Events are only allowed to be processed when causality

errors are guaranteed not to happen. For this purpose, parallel workers needto commit to a lookahead period for which they guarantee not to executeany event procedures that could violate causal consistency. “Conservativesimulation requires frequent communication, even when no dependencies arepresent” [JAGP12].

Optimistic approach: Events are speculatively processed and in case of a causalityerror the simulation is rolled back. This includes sending cancelation messages toother workers, which could lead to a chain of reactions. According to [JAGP12]this approach is “sensitive to communication latency, and incurs the overheadsassociated with checkpointing and rollbacks.”

7

Chapter 3

Parallel Computing

In order to fully exploit the performance of a multi-core processor, efficient usage of itsresources is important. Section 3.1 discusses some key aspects of multi-core processorswhich developers should keep in mind when writing parallel software. Two resultingbest practices are discussed in section 3.2.

3.1 Parallel Computer ArchitectureA bottom-up introduction to three forms of parallelism in multi-core processors isgiven in section 3.1.1 to 3.1.3. Then the implementation of distributed caches andaddress translation in current generation Intel processors are discussed in section 3.1.4and 3.1.5. Finally, special topics such as vector processing, accelerators and DynamicVoltage and Frequency Scaling (DVFS) are discussed in section 3.1.6 to 3.1.8

3.1.1 Instruction-Level ParallelismExecuting a single instruction on a processor conceptually consists of at least thefollowing steps:

1. Fetch (copy) next instruction from memory.

2. Decode (interpret) instruction.

3. Execute instruction.

4. Write result and increment program counter.

To execute a program, those steps are repeated until it terminates. In anoversimplified model, it can be assumed that every step takes one cycle to complete inthe best-case scenario. The simplest possible way of execution is shown in figure 3.1.In this strictly sequential form of execution, the processor is clearly underutilised, sinceat any time only one stage is active.

Fortunately, it is often possible to start a new instruction every cycle, insteadof waiting until the previous instruction has been completely processed. Thisoptimisation is called pipelining (see figure 3.2) and allows multiple instructions indifferent stages to be executed simultaneously. All instructions are fetched from

8

the same thread of execution. Pipelining is therefore largely transparent from aprogrammer’s point of view.

An additional optimisation is superscalar execution (see figure 3.3). It allowsmultiple instructions to be issued in every cycle and hence multiple instructionscan be at the same stage at the same time. A superscalar processor has multipleexecution units of the same or different type (such as Integer Units (IUs), FloatingPoint Units (FPUs), etc.) and is able to fetch and decode multiple instructions inpairs. As in pipelining, all instructions are fetched from the same thread of execution.

Both figure 3.2 and 3.3 show ideal scenarios in which the processor is efficientlyused. In reality the performance can degrade due to pipeline stalls and pipeline flushes(see figure 3.4). Stalls occur if an operation takes longer to execute in any of thepipeline stages. This reduces the throughput since some stages become idle (a bubbleis said to appear in the pipeline). Stalls can occur due to data and control dependencies(hazards). A similar issue occurs when not all issue slots in a superscalar processorcan be filled in the first place due to hazards. During a pipeline flush all operationscurrently in the pipeline need to be cancelled and program execution continues at adifferent address. This occurs typically in case a branch leads to a jump instructionbeing executed. Optimisations to avoid stalls and flushes are out-of-order executionand branch prediction. Both have the goal of keeping the pipeline busy.

3.1.2 Simultaneous MultithreadingA processor capable of Simultaneous Multithreading (SMT) can issue instructionsfrom multiple threads of execution in a single cycle (see figure 3.6). This contraststo Instruction-Level Parallelism (ILP) (see section 3.1.1) in which all instructions arefetched from the same thread of execution (see figure 3.5). This enhancement allowsempty issue slots and bubbles caused by stalls in a superscalar processor to be filled.In contrast to ILP a single thread of execution will never see any benefit from SMT.

SMT is implemented by duplicating the “architectural state” (data, status andcontrol registers) [Int16]. However, most other resources, most notably execution unitsand caches, are shared. As a consequence SMT can only speed up execution if noneof the shared resources are a limiting factor. In practice it is often experimentallydetermined if a given workload is SMT friendly [Fog15]. SMT is marked by Intel asHyper-Threading (HT).

3.1.3 Multi-CoreA multi-core processor allows more efficient parallel processing from multiple threads ofexecution than Simultaneous Multithreading (SMT) since fewer resources are shared.Not only the “architectural state”, but also the execution units and higher level cachesare fully replicated. Each replicated processing unit is referred to as core.

Modern Intel processors use a 3-level cache hierarchy. The Level 1 (L1) and Level2 (L2) caches are private to processor cores. In contrast, the Level 3 (L3) cache isshared by all cores. The L1 cache is often divided into a Level 1 Data Cache (L1D)and Level 1 Instruction Cache (L1I). Such a configuration is conceptually shown infigure 3.8.

9

Fetch

Decode

Execute

Write Back

Completed

Upcoming

Figure 3.1: Strictly sequential processing

Fetch

Decode

Execute

Write Back

Completed

Upcoming

Figure 3.2: Four-stage pipelined execution

10

Fetch

Decode

Execute

Write Back

Completed

Upcoming

Figure 3.3: Two-way superscalar and four-stage pipeline

Fetch

Decode

Execute

Write Back

Completed

Upcoming

Figure 3.4: At t1 the turquoise operation takes two cycles to decode, which causes astall and as a consequence bubbles appear in the pipeline. At t5, the pipeline is flushedduring execution and control flow is diverted to another address. At t6 there is only oneinstruction available to be fetched, hence not all issue slots of the superscalar processorcan be filled.

11

Register L1 CacheLL Cache 7-Stage Pipeline

Figure 3.5: 4-way superscalar execution. All instructions are taken from a single threadof execution which can use all space available in all caches.


Figure 3.6: 2-way SMT and 4-way superscalar execution. Empty pipeline slots can befilled with instructions from another thread of execution. Register files are replicatedwhereas caches and execution units are shared.


Figure 3.7: Dual-core, 2-way SMT and 4-way superscalar execution. The Last LevelCache (LLC) is shared amongst all threads of execution.

12

CoreL1L2

CoreL1L2

CoreL1L2

CoreL1L2Core L1

L2

Core L1L2

Core L1L2

Core L1L2

Processore.g. Xeon E5-2630 v3

Mem Controller

RAM

L3 Slice L3 Slice

L3 Slice L3 Slice

L3 Slice L3 Slice

L3 Slice L3 Slice

QuickPathInterconnectPCI Express I/O Hub

Figure 3.8: Simplified view of a Intel Haswell-EP eight-core processors such as theone used in this project. Eight cores contain private L1 and L2 caches. The L3 cacheis divided into eight slices. Communication between cores, cache slices, the memorycontroller and peripherals is achieved through a double ring bus.

3.1.4 Distributed CacheMulti-core processors increase the complexity involved in cache management greatly.In modern Intel processors the Level 3 (L3) cache is “inclusive of all lower levels ofthe cache hierarchy” [VJT11], meaning that all cache lines (units of data in the cache)that are stored in either the Level 1 (L1) or Level 2 (L2) cache are guaranteed to bealso stored in the L3 cache.

Intel processors implement the “Modified, Exclusive, Shared, Invalid and Forward(MESIF)” cache coherence protocol. The letters of the acronym denote all possiblestates a cache line can be in. MESIF is equal to the MESI protocol, but adds theForward state (a special form of the Shared state) as a performance optimisation[HP12]. The L3 cache includes a mask of “core valid” bits to indicate which “core mayhave a copy of that cache line” [VJT11] in its private L1 or L2 cache. If more thanone core valid bit is set, it is guaranteed that the cache line is not in modified state inany higher level cache [Str].

If an access misses in a core’s L1 and L2 cache and in turn the L3 cache is accessedthe following scenarios are possible [Inta]:

Hit Other Core Modified The cache line is present in the L3 cache and the corevalid bit is set for one other core. The other core’s cache is snooped and the

13

cache line is found in M state. The cache line needs to be written back to thelevel 3 cache. See figure 3.9a.

Hit Other Core Unmodified The cache line is present in the L3 cache and the corevalid bit is set for one other core. The other core’s cache is snooped and the cacheline is found in E or S state. Hence, it is safe to read from the level 3 cache. Seefigure 3.9b.

Hit Snoop Miss The cache line is present in the L3 cache. The core valid bit is set forone other core. The other core’s cache is snooped but misses in the other core’scache. Hence, it is safe to read from the level 3 cache [VJT11]. See figure 3.9c.

Hit No Snoop Needed The cache line is present in the L3 cache. The core validbit is not set or it is set for more than one core. Hence, it is guaranteed that thecache line is not in the modified state in any other core’s L1 or L2 cache. Seefigure 3.9d.

Missed, Access to DRAM The cache line is not present in the L3 cache. Due tothe inclusiveness property the cache line is guaranteed not to be in any othercore’s L1 or L2 cache. Data needs to be transferred from DRAM. See figure 3.9e.

3.1.5 Virtual Memory and Translation Lookaside BufferIn all modern operating systems supporting virtual memory the mapping of virtual tophysical addresses is stored in a data structure called page table. The page table itselfresides in memory. Since accessing memory for every address translation would resultin an unacceptable performance penalty, caches are used to speed up the process. Inmodern Intel processors the following locations are searched until a translation is found(see figure 3.10, [Int15][Inta]):

1. Lookup address translation in Data Translation Lookaside Buffer (DTLB) orInstruction Translation Lookaside Buffer (ITLB) respectively.

2. If missed, lookup address translation in unified Second Level TranslationLookaside Buffer (STLB).

3. If missed, lookup address translation in the page table. This process is calledpage walk. The page table is cached in the regular data caches:

(a) Lookup page table in Level 1 Data Cache (L1D).(b) If missed, lookup page table in Level 2 (L2) cache.(c) If missed, lookup page table in Level 3 (L3) cache.(d) If missed, lookup page table in memory.

Since page tables are usually not just flat data structures, page walks often includereading several memory locations which makes them time consuming operations.

14

NUMA Node

L3

...

Core 0L1

L2State Data

DataCore Valid

Core 1L1

L2State Data

Core 2L1

L2State Data

M

010..

f00d

cafe

1)2) 3)4)

(a) Case “Hit Other Core Modified”: Access misses in a core 0’s L1 and L2 cache (arrow 1,red). Since the core valid bit is set for core 1, the cache of core 1 needs to be snooped (arrow2, brown). In core 1’s cache the cache line is found in Modified state, hence the value needsto be written back to the L3 cache (arrow 3, green). Finally, the value can be read into core0’s L2 and L1 caches (arrow 4, blue).

NUMA Node

L3

...

Core 0L1

L2State Data

DataCore Valid

Core 1L1

L2State Data

Core 2L1

L2State Data

E/S

010.. cafe

cafe

1)2)3)

(b) Case “Hit Other Core Unmodified”: Access misses in a core 0’s L1 and L2 cache (arrow 1,red). Since the core valid bit is set for core 1, the cache of core 1 needs to be snooped (arrow 2,brown). In core 1’s cache the cache line is found in either Exclusive or Shared state. Hence,the value in the L3 cache is valid and can be read into core 0’s L2 and L1 caches (arrow 3,blue) without the need for writing back data.

Figure 3.9: Examples of different scenarios when accessing the shared level 3 cache.Core 0 tries to read a cache line, but misses in its private L1 and L2 cache. Thefurther behaviour is dependent on the state of the L3 cache and the other core’s caches.

15

NUMA Node

L3

...

Core 0L1

L2State Data

DataCore Valid

Core 1L1

L2State Data

Core 2L1

L2State Data

cafe

1)2)3)

(c) Case “Hit Snoop Miss”: Access misses in a core 0’s L1 and L2 cache (arrow 1, red). Sincethe core valid bit is set for core 1, the cache of core 1 needs to be snooped (arrow 2, brown).In core 1’s cache the snoop misses. Hence, the value in the L3 cache is valid and can be readinto core 0’s L2 and L1 caches (arrow 3, blue).

NUMA Node

L3

...

Core 0L1

L2State Data

DataCore Valid

Core 1L1

L2State Data

Core 2L1

L2State Data

S

011.. cafe

cafe S cafe

1)2)

(d) Case “Hit No Snoop Needed”: Access misses in a core 0’s L1 and L2 cache (arrow 1,red).Since the core valid bit is set for core 1 and 2, the cache line cannot be in modified state ineither of the cores. Hence the value in the L3 cache is valid and can be read into core 0’s L2and L1 caches (arrow 2, blue).

Figure 3.9: Continued

16

NUMA Node

L3

...

Core 0L1

L2State Data

DataCore Valid

Core 1L1

L2State Data

Core 2L1

L2State Data

1)

2)

3)

(e) Case “Missed, Access to DRAM”: Access misses in a core 0’s L1 and L2 cache (arrow 1,red). Since it also misses in the L3 cache, the data needs to be loaded from DRAM (arrow 2,black). Then the data can be read into core 0’s L2 and L1 caches (arrow 3, blue).

Figure 3.9: Continued

3.1.6 Vector ProcessingModern processors include special execution units for vector operations. Examplesinclude IBM’s AltiVec and Intel’s MMX instructions. The latter were subsequentlyenhanced and are called Advanced Vector Extensions (AVX) in their most recent form.Vector operations are a form of parallelism often referred to as data parallelism orSingle Instruction, Multiple Data (SIMD). Vector operations allow operations ondifferent data items to be carried out at once. Even though a single vector instructionoften takes slightly more time to execute than a scalar instruction, the fact that they“get more work done” results in noticeable program speedups.

Compilers try to identify code sections that can be vectorised in a process calledautomatic vectorisation. However, there are situations when the compiler cannotautomatically infer that vectorisation is safe. Therefore, the need for explicit vectorprogramming arises. This can be done through vector intrinsics or higher levelabstractions. The use of intrinsics is discouraged since they are instruction-setdependant (e.g. code written using MMX intrinsics cannot be efficiently run on aprocessor supporting AVX). Higher level mechanisms for explicit vector programmingsuch as the ones available in Open Multi-Processing (OpenMP) and Cilk Plus abstractthose differences and enable performance portability.

3.1.7 Accelerators, Coprocessors and Many-CoreTraditionally, specialised hardware or Field Programmable Gate Arrays (FPGAs) wereused as accelerators alongside general purpose processors. With the introduction of

17

P7

Fetch

Decode

Branch Predict

Write Back

Reservation Station

Px P2 P3 P4

Load and Store Units

Oth

er U

nits

(IU, F

PU, e

tc.)

DTLBL1D

L1I

STLB

L2L3

ITLB

CoreReplicated

UncoreShared amogst cores

Figure 3.10: Simplified view of Intel Haswell pipeline and caches. Execution ports2, 3, 4 and 7 can be used for memory operations. Address translations are cachedin the Data Translation Lookaside Buffer (DTLB)/Instruction Translation LookasideBuffer (ITLB) and Second Level Translation Lookaside Buffer (STLB). If a lookupmisses in the STLB a page walk is started. Its memory accesses are cached in Level 1Data Cache (L1D), Level 2 (L2) and shared Level 3 (L3) caches. Sources: [Int15]

NVIDIA’s Compute Unified Device Architecture (CUDA) in 2007 the entry barrier forusing accelerators was significantly lowered. To use CUDA no detailed knowledge ofhardware development or Graphics Processing Unit (GPU) architecture is required.Today the usage of GPU as accelerators is not restricted to CUDA as a programminginterface and not restricted to NVIDIA as a GPU vendor. GPUs are particularly wellsuited for algorithms which contain a lot of data parallelism (see section 3.1.6), sincethe cost of diverging branches is high.

More recently Intel entered the market of coprocessors with its Many IntegratedCore (MIC) architecture (marketed as Xeon Phi). The key features are many simplecores and wide vector units. This makes MIC particularly suitable for algorithms witha high degree of data parallelism. The first generation MIC was available only as acoprocessor connected through PCI express. In contrast the second generation MICis additionally available as a standalone host processor, which can run any operatingsystem. This step required significant changes to the processor design: the secondgeneration MIC processor is binary compatible with regular Xeons and has increasedscalar performance [Sod+16].

18

3.1.8 Dynamic Voltage and Frequency ScalingDynamic Voltage and Frequency Scaling (DVFS) is used in modern processors toimprove energy efficiency. Intel’s implementation is known as Turbo Boost. Thistechnology allows the processor to run at a higher frequency than its nominaloperating frequency “while ensuring that it does not exceed its electrical and thermalspecifications” [Intb]. The maximum allowed frequency is dynamically determined bytaking into account the number of active cores, actual current consumption, powerdissipation, and temperature. How much above the nominal operating frequency aspecific processor can run is model dependent.

Turbo Boost can have a significant impact on the execution time. This is especiallytrue when comparing sequential and parallel execution of a program, since thesequential execution can benefit from a higher core frequency. It is therefore oftenrecommended to disable Turbo Boost during the performance optimisation processto obtain more consistent runtime measurements and a better picture of the truescalability of a program [BCS12][Man15].

3.2 Parallel ProgrammingTask-oriented programming has proven to be an adequate abstraction for efficientlyusing parallel compute resources and is discussed in section 3.2.1. Similarly, localityof reference, which is introduced in section 3.2.2, is an important concept in order tomake efficient use of caches.

3.2.1 Task-Oriented Programming ModelsExperience from parallel programming over the past years has shown thatprogramming directly towards threading Application Programming Interfaces (APIs)often is problematic [Lee06]. It has been generally acknowledged that task-orientedprogramming models overcome those issues. Task-oriented programming models haveexisted for many years but interest in those technologies, which abstract parallelismfrom the underlying execution mechanisms, seems to be growing, especially as theneed for nested parallelism (see figure 3.11) and performance portability rises. Both ofwhich are hard to achieve by directly programming towards a threading API [MRR12].One of the main problem with threading APIs is that they enforce parallel execution.As an example, imagine an 8-core processor on which only four threads are running. Ifone of the threads encounters a nested 3-way parallel region the nested region shouldbe executed in parallel in order to increase the system utilisation. However, if the sameworkload is running on a quad-core processor, parallel execution of the nested regionwill inevitably lead to oversubscription of the processor.

In task-oriented programming “the programmer takes on the burden of identifyingwhat can be computed safely in parallel, leaving the decision of exactly how the divisionwill take place to the runtime system” [MKH91]. The how part includes scheduling,mapping to hardware, load balancing etc. Even implementations of the model haveexisted for a very long time [Blu+95]. Scheduling is usually implemented through somesort of work stealing. One benefit of work stealing as opposed to having a centralisedwork queue is the reduced need for synchronisation.

Popular technologies supporting the task-oriented model include: Open Multi-Processing (OpenMP), Cilk Plus and Intel Threading Building Blocks (TBB).

19

(a) Parallelism (b) Nested Parallelism

Figure 3.11: In nested parallelism parallel regions contain themselves another level ofparallel regions.

3.2.2 Locality of ReferenceThe caches introduced in section 3.1 can only benefit a program’s execution if itexhibits a sufficient degree of locality of reference. Two main types of locality aredistinguished: temporal locality means that recently used data is likely to be usedagain, whereas spacial locality means that neighbouring data is likely to be accessedsoon. This is conceptually illustrated in figure 3.12. In order to exploit spatial locality,caches are organised in terms of cache lines. Cache lines (typically 64 bytes) are theunits of consecutive data which get transferred into and stored in caches. Even ifan instruction is operating only on a single byte, a complete cache line is transferredinto the cache. The advantage is that if a later instruction accesses neighbouring datafalling into the same cache line, then no transfer is needed. Similarly, processors tryto maximise the effect of spatial locality by prefetching cache lines if they can detectan ascending or descending access pattern.

Both spatial and temporal locality apply to data and code. Data locality is mostlyinfluenced by the algorithms and data structures used. Code locality depends on thenumber and distance of jump instructions. Compilers have functionally that can helpminimise jumps.

If a program lacks locality of reference, cached data is evicted before it can bepotentially used. If caches are shared as in Simultaneous Multithreading (SMT) (seesection 3.1.2) or multi-core (see section 3.1.3) even cached data belonging to anotherthread of execution might be evicted.

20

Time

Mem

ory

Add

ress

Acc

esse

d

Temporal Locality

Spatial Locality

Figure 3.12: Conceptual illustration of locality of reference. The graph shows memoryaddresses accessed over time. Temporal locality is when the same addresses are accessedin a short time interval. Spatial locality is when addresses are accessed in ascendingor descending fashion.

21

Chapter 4

Related Work

Intel Threading Building Blocks (TBB) is a collection of components to supportparallel programming using C++ on shared memory machines. TBB started as acommercial product from Intel in 2006 and hence predates C++11 threading facilities.TBB includes amongst other components a task scheduler based on work stealing,algorithmic skeletons which make use of tasks, synchronisation primitives, containersand memory allocators. The scheduler is highly influenced by the Cilk Plus schedulerwhich was presented in [Blu+95]. TBB is implemented as a C++ template libraryand therefore works with any C++ compiler, in contrast to Cilk Plus which is anextension for C/C++/Fortran compilers. The downside of the template approach isthat, some features available in Cilk Plus, such as explicit vector programming is notavailable in TBB. However, Cilk Plus and TBB can be combined in order to getthe best of both worlds [MRR12]. Similarly, the TBB components such as memoryallocators can be used independently and in conjunction with other technologies. Theinternals of both TBB’s work stealing task scheduler and TBB’s memory allocator areexplained in [KV07]. The two main concepts of the allocator are: thread-private heapsto reduce locking and segregation (grouping objects by size). Segregation “providesbetter locality for similarly-sized objects that are often used together” [KV07]. Themaximum allowed difference in size is 25% which according to the authors preventsexcessive internal fragmentation.

Akaroa by [PYM94] is an often cited implementation of a Multiple Replicationsin Parallel (MRIP) based simulator. Version 2 was presented in [EPM99]. In thispublication the authors conclude that “the MRIP scenario can achieve speedup equalto the number of processors used”. The paper also describes how the “initialization-bias problem” is mitigated and how pseudo-random numbers are distributed. [Haq+11]analyses the performance impact of running Akaroa2 on PlanetLab, a network withnodes deployed around the world. As expected the performance decreases when thecommunication latency is increased. But the authors conclude that “distributedcomputing resources [...] can be effectively used for running quantitative stochasticsimulations in MRIP scenario”.

“Warp-Level Parallelism” is proposed by [Pas+11] to implement an MRIP basedsimulator using Compute Unified Device Architecture (CUDA). The idea behind thisapproach is to only execute one thread per graphics card warp. This way branchdivergence does not negatively affect the runtime. However, this comes at the cost ofpoor utilisation of the graphics card.

22

The High-Level Architecture for Modelling and Simulation (HLA) is an IEEEstandard for creating reusable and interoperable simulation components. Theexecution of multiple simulations components, called federates in HLA jargon, iscontrolled by a Runtime Infrastructure (RTI) software. The HLA standard onlyprovides the specification for an RTI software, consequently a number of commercialand non-commercial implementations exist. HLA implements Single Replication inParallel through Model Decomposition (SRIP.M) and supports both conservative andoptimistic execution of the simulation [Fuj98].

A conservative simulator based on Cilk which makes use of work stealing ispresented in [CLT97].

The “GPU-based discrete-event simulation kernel” introduced by [TY13] consistsof three algorithms which enable an efficient execution of conservative SRIP.Msimulations on CUDA Graphics Processing Units (GPUs). The “breadth-expansionconservative time window” algorithm is used to find events which are safe to beprocessed. The “event redistribution” algorithm ensures that events of the same typeget executed within neighbouring warps to minimise the impact of branch divergence.The third algorithm is concerned with memory management.

A conservative simulation algorithm used to cycle-accurately simulate a 64-coreprocessor on a 16-core host machine is discussed by [Lv+10]. They can achieve a goodspeedup also thanks to the parallelism naturally present in their simulation target.

How the performance of an optimistic simulator could be improved by exploitingthe low communication latencies on a shared-memory multi-core processor comparedto a previous Message Passing Interface (MPI)-based distributed memory solution isdescribed by [JAGP12]. Their implementation is based on Portable Operating SystemInterface (POSIX) threads.

The use of Dynamic Voltage and Frequency Scaling (DVFS) to minimise thenumber of rollbacks in an optimistic simulator is investigated by [CW12]. They“increase the frequency of the cores executing threads on the critical path (thoseexperiencing infrequent rollback) and decrease the frequency of the cores executingthreads off the critical path (those experiencing excessive rollback)”. They notethat for an efficient application of their method, it must be possible to set a core’sfrequency above its base frequency from software. For their work they used anexperimental many-core processor from Intel which had this functionality. However,this is something usually not available on standard processors.

The spray list, a “scalable relaxed priority queue” inspired by skip lists is suggestedby [Ali+15]. Instead of letting parallel dequeue operations compete on the highestpriority element, each thread randomly skips a few elements in the queue, such thatoperations take place on distinct elements. In an experiment that resembles anoptimistic SRIP.M workload the authors show that the increased scalability of thedata structure can lead to a higher efficiency, even though some rollbacks occur dueto causality errors introduced by picking a lower priority element.

One of the biggest challenges when implementing optimistic SRIP.M is theautomatic implementation of rollback methods. Historically, this was achieved throughstate-saving. [LGC14] presents an alternative approach based on reverse computing,which allows rollback methods to be automatically generated. However, their methodhas to revert to state-saving in some occasions.

A source-to-source code transformation tool for the C++ language that can beused to automatically generate reverse code for event procedures used in an optimisticSRIP.M simulator is presented by [Sch+15]. The original code is instrumented and

23

keeps track of all the memory locations modified by event procedures. The authorsadmit that their approach results in a runtime and memory overhead, but highlightthat their method is completely automated.

24

Chapter 5

Optimising Scalability

This chapter starts by introducing the most important details about theimplementation of SIMLOX in section 5.1. Then the optimisation process, test casesand test hardware used in this project are presented in section 5.2. Section 5.3explains profiling and microarchitecture analysis and how the VTune software works.Section 5.4 and 5.5 track down the root causes of the scalability issues. Finally,section 5.6 and 5.7 go into the optimisations carried out in this project an their details.

5.1 SIMLOX ImplementationSIMLOX is a dynamic, probabilistic, event-driven simulation software which in itscore follows the simple algorithm introduced in section 2.2. Implementation is donein C++ and development started in the mid ’90s. Hence, parts of the implementationoriginates from source code which is older than the first C++ standard. Having sucha long heritage explains why some features are not implemented the way one wouldimplement them today, if starting development from scratch. Currently SIMLOXcan only be run on the Windows platform, mainly because of its dependency on theMicrosoft Foundation Class Library (MFC). To give a rough feeling of the extent ofSIMLOX’s code base the number of lines of code are shown in table 5.1.

Table 5.1: Approximate Lines of Code in SIMLOX

Lines of Codeexcluding Comments

w/o User Interface 125’000with User Interface 420’000

5.1.1 Result CollectionRather than storing every event that occurred during simulation in its output, SIMLOXmaps all events into discrete-time “result collection intervals” and only stores theaggregated results per interval. The left-closed, right-open intervals [t1, t2) havevariable length and are non-overlapping. This is conceptually illustrated in figure 5.1.

25

The default setting in SIMLOX is to have 24 hour result collection intervals over thewhole simulation period. However, most of the cases analysed in this project definevarying interval lengths. Simulation results in SIMLOX usually look like the screenshotin figure 5.2

Process

Figure 5.1: Part of the result collection is to find into which intervals a given process(e.g. a mission, maintenance task) falls. Intervals have variable length, are half-closedand non-overlapping.

Figure 5.2: Typical result view in SIMLOX. The graph shows the states a number ofsystems are in over time.

5.1.2 Parallelism in SIMLOXSIMLOX implements the Multiple Replications in Parallel (MRIP) strategy (seesection 2.3.1). C++11 threading facilities are used for the implementation. Sincethey lack a built-in task abstraction (see section 3.2.1) SIMLOX contains a customsolution for task-oriented programming. Replications are modelled as tasks, whichget scheduled to worker threads through a centralised, lock-based work queue. Theoverhead resulting from the lock is expected to be low, since the runtime of a replicationis rather long (typically seconds to minutes). Therefore, the lock is acquired onlyinfrequently and for a short amount of time.

The execution of a simulation in SIMLOX can be broken down into three phasesas shown in table 5.2 and figure 5.3. One particularity is that results of thosereplications that get scheduled to the same worker thread, are aggregated by the

26

workers themselves during parallel execution. Hence, during the final sequential phases2 only the per worker results need to be aggregated. The runtime of s2 thereforegrows with O(Worker Threads) and not O(Replications).

Table 5.2: The three phases of a simulation run in SIMLOX.

Phase Execution Descriptions1 Sequential Read input file and create

in-memory representation.rn Parallel Execute n replications of

discrete-event algoritm inparallel.

s2 Sequential Aggregate per worker resultsand write result file.

Main Worker 1 … 4

Figure 5.3: The three phases of a simulation run in SIMLOX. In this example eightreplications are scheduled over four workers.

5.2 Iterative Optimisation ProcessTo understand and mitigate the scalability problems in SIMLOX, an iterative approachto performance tuning was chosen for this project (see for example [Mar14][Man15]).The process is briefly described in table 5.3 and figure 5.4.

In this project correctness of the optimisations was ensured by verifying that thesoftware still yields the same results for all test cases. This guarantees the validity ofthe performance evaluation. For inclusion in the next stable release, more testing willbe required. However, as of section 1.3 this is outside of the scope of this thesis.

27

Table 5.3: Phases in the iterative optimisation process

Phase DescriptionSelect test case Choose input data sufficiently large to expose

performance issue.Measure scalability Execute program with varying thread count

and measure runtime to identify at whichpoint the scalability bottleneck is encountered.

Instrument and run Gather information on program execution,typically through the use of a profiler.

Identify bottlenecks Analyse profiler output and identify root causeof scalability bottleneck.

Optimise Change program to mitigate scalabilitybottleneck.

Validate Validate correctness and performanceimprovement of change.

Select test case Measure scalability

Instrument and run

Identify bottlenecks

Optimise

Validate

Figure 5.4: Phases in the iterative optimisation process

28

5.2.1 Test CasesIn SIMLOX different cases can activate different event routines. For example aparticular case might focus on simulating maintenance personnel availability. Adifferent case might not contain any information on maintenance personnel at all.This will lead to totally different code sections being executed and consequentially itis possible that completely different scalability bottlenecks can be observed.

Similarly, comparing the size of different SIMLOX simulation cases is hardlypossible due to the vast number of input dimensions. The simulation period, amountof events processed, granularity of result collection intervals, number of systems andcomponents, complexity of support organisation and the number of missions are onlya few of the many factors affecting the simulation runtime. As a consequence, it isoften impossible to predict the runtime behaviour given any of those parameters. Inpractice the runtime itself is often used as a metric to describe the size of a simulationcase.

Table 5.4 contains an overview of the test cases used in this project. All the testcases were obtained from Systecon’s consulting unit and therefore are representativeof what SIMLOX end users encounter in practice.

Table 5.4: Test cases used in this project. Baseline runtime is given for sequentiallyexecuting eight replications. Baseline efficiency is shown for 8-way parallel execution.

x Baseline BaselineName Runtime Efficiency CommentSubway Depot ~3.5 min 0.94 Has almost linear scalability in

baseline version. Used to ensurethat none of the changes worsensscalability.

Light Rail ~3 min 0.60 Has the shortest runtime. Mostprofiling is therefore done on thiscase.

Subway Line A ~18 min 0.14 Baseline parallel efficiency is poor.Subway Line B ~13 min 0.31 Similar to case subway Subway

Line A.Rolling Stock ~80 min 0.81 Has the longest runtime.Wind Farm ~30 min 0.53 Has very short result collection

intervals.

5.2.2 Test HardwareThe technical specification of the test system used in this project is given in table 5.5.

5.3 Profiling and Microarchitecture AnalysisGathering runtime information of a program is referred to as profiling. Profiling is usedto find out how a program uses compute resources such as CPU time, memory, etc.The goal of profiling is to identify hotspots, sections of the program that consume a

29

Table 5.5: Specification of hardware used in this project

Number of Processors 1Processor Intel Xeon E5-2630 v3Microarchitecture Intel Haswell-EPPipeline 14 - 19 stagesSuperscalar Execution 4-way superscalarSimultaneous Multithreading 2-way SMTCore Count 8 cores (see figure 3.8)Base Frequency 2.4 GHz

Max. Turbo Boost Frequency [Mic1616]2.6 GHz with 8 active cores2.9 GHz with 4 active cores3.2 GHz with 1 or 2 active cores

Level 1 Instruction Cache 32 kilobytesLevel 1 Data Cache 32 kilobytesLevel 2 Cache 256 kilobytesLevel 3 Cache 20 megabytes (shared)Data TLB 32 entriesInstruction TLB 64 entriesSecond Level TLB 512 entriesDRAM 64 gigabytes

significant amount of resources and therefore are candidates for optimisation. Profilerscan be classified into two categories: instrumentation and sampling.

For instrumentation the profiler inserts bookkeeping functionality into the sourcecode or binary, for example whenever a method is entered and exited. This overheadcan drastically increase the program runtime. Hence, it is often infeasible to run largeproblem instances with an instrumented binary.

A sampling profiler does not modify the executable, but instead interruptsexecution at periodic intervals to record the program state. Statistical analysis is thenperformed based on the samples taken. As a consequence, the results from samplingare less accurate than those from instrumentation but the overhead is lower.

Through profiling both algorithmic and hardware issues can be identified. Thelatter is also referred to as microarchitecture analysis. It is concerned with findingout where issues such as pipeline stalls and flushes are originating from (seesection 3.1.1). Such low-level information is collected through hardware performancecounters (registers built into the processors). Having dedicated hardware allows thisinformation to be gathered with a minimal performance overhead.

In this project Intel VTune Amplifier (short: VTune) is used. It is a low-overheadsampling profiler which abstracts away the programming of hardware performancecounters and differences between processor generations from the user. The overheadin runtime is typically only a few percent, which makes it possible to analyse theexecution of long running programs. The output of VTune is presented in a highlygraphical way for easier interpretation.

30

5.4 Performance Pitfalls in Parallel Programming[MRR12] gives an overview of the most common pitfalls in parallel programming basedon many years of experience. In this section those pitfalls related to performance arepresented and it is speculated if they potentially affect SIMLOX. This discussion laysthe ground for the in-depth analysis that follows in the upcoming sections.

Lock Granularity Locks are low-level constructs for synchronisation and to preventraces. Inappropriate lock granularity can lead to severe performance issues.In case of excessive fine-grained locking, the cost of lock overhead becomesunacceptable, i.e. much time is spent acquiring and releasing locks. On theother hand, if a coarse-grained locking strategy is employed, threads might haveto wait a significant amount of time until locks are released. Similar issues canoccur with atomic operations.The SIMLOX software uses a work queue guarded by a lock, but the lock isacquired only infrequently and for a short amount of time (see section 5.1.2).Therefore, lock and synchronisation related issues were initially not assumed tobe responsible for the poor scalability observed. An assumption which later hadto be revised (see section 5.6.2 and 5.6.3).

Lack of Locality Locality of reference is explained in section 3.2.2. If a programexhibits a low degree of locality, the communication overhead increases whichcan severely degrade performance.In SIMLOX many algorithms iterate over lists of pointers to objects allocated onthe heap. This indirection leads to non-uniform memory access patterns sincethere is no guarantee that objects get placed adjacently on the heap, even if theyare allocated consecutively. Lack of space locality is therefore a probable sourceof the scalability problems observed.

Load Imbalance Significantly different runtimes of workers inside a parallel regioncan lead to poor usage of parallel computer resources since the program executiontime and speedup is limited by the longest executing worker, as Amdahl’s lawexplains. Various reasons can lead to load imbalance issues: a parallelisationstrategy leading to unpredictable runtimes, poor load balancing strategy ordifferences in hardware performance.SIMLOX’s Multiple Replications in Parallel (MRIP) parallelisation strategy (seesection 2.3.1 and 5.1.2) could potentially lead to load imbalance issues, sincethe time required to execute a replication is dependent on the random numbersdrawn. Fortunately, the impact on the runtime is low in practice as figure 5.5shows.

Overhead Launching workers, scheduling tasks, synchronising, aggregating resultsetc. during parallel execution adds overhead to the program. The overheadneeds to be properly balanced such that it can be amortised by the parallelexecution. [MRR12] suggest to use tree based schemes to minimise overhead.SIMLOX at first glance does not seem to suffer from excessive overhead.

31

5.5 Symptoms of Scalability Issue in SIMLOXFrom figure 5.5 the question arises why individual replications take significantly moretime to execute in parallel execution compared to sequential execution. Interestingly,the Cycles per Instruction Retired (CPI) metric which measures the efficiency ofprogram execution, seemed to correspond to this observation as table 5.6 shows. CPIis, as the name suggests, defined as follows:

CPI = Clock CyclesInstruction Retried

On a 4-way superscalar processor (see section 3.1.1), such as the one in the testsystem used in this project (see section 5.2.2), the best possible CPI is 0.25. Datahazards, stalls, mispredicted branches etc. can lead to a higher CPI, which indicatesa less efficient execution.

5.6 OptimisationsSections 5.6.1 to 5.7.2 detail the optimisations implemented as part of the iterativeprocess described in section 5.2.

5.6.1 Choosing a Compiler and Compiler OptionsDifferent compilers and compiler options can under some circumstances have asignificant impact on the application runtime. The best suited compiler shouldrepetitively be reevaluated [Man15]. SIMLOX was in the past developed with VisualStudio and the bundled Visual C++ compiler. In this project the performance of thecompilers listed in table 5.7 was evaluated.

Due to the usage of C++11 threading facilities in SIMLOX (see section 5.1.2) the/Qstd=c++11 flag is required to compile SIMLOX. However, older parts of the sourcecode in SIMLOX use for-loops which do not conform to the standard and require the/Zc:forScope- option in order to compile. On the Intel C++ compiler combiningthose two options is illegal. Hence, all non-conforming loops had to be rewritten inorder to successfully compile with the Intel C++ compiler. Similarly, there was a bugin the source code which prevented the Visual C++ compiler from using auto-inlining.This bug did not appear on the Intel C++ compiler.

With regard to compiler flags, the best performance was achieved when enablingmaximum optimisation (/Ox) and interprocedural optimisation (/GL and /Qiporespectively). Other optimisation flags (see for example section 5.7.1) did not resultin an immediate performance benefit and were therefore disabled.

With those options enabled the two compilers achieved essentially the sameperformance (see chapter 6). Therefore, the decision was made to use the Intel C++compiler in this project.

5.6.2 Using a Scalable Memory AllocatorSIMLOX relies heavily on dynamic memory allocation. Commonly between 10 and15% of the execution time is spent inside memory allocation and deallocation functions.Memory allocations occur either in the form of direct calls to the new operator or by

32

21 22 23 24 25 26 27 280

10

20

30

40

50

Runtime of an individual replication in seconds

Num

ber

ofR

eplic

atio

nsin

each

Gro

up

Sequential execution8-way parallel execution

Figure 5.5: Histogram showing runtimes of individual replications of test case LightRail in sequential (µ = 21.870 s, σ2 = 0.106) and parallel execution (µ = 26.601 s,σ2 = 0.443). The variance is relatively low, but difference between the two averagesof more than 4.7 s hurts scalability. Turbo Boost (see section 3.1.8) is disabled. Thesame phenomenon was observed with other test cases.

Table 5.6: Runtime and CPI in sequential and parallel execution for test case LightRail. The same phenomenon was observed with other test cases.

8-waySequential ParallelExecution Execution

Runtime per Replication µ = 21.870 µ = 26.601σ2 = 0.106 σ2 = 0.443

Cycles per Instruction Retired (CPI) 0.658 0.894

Table 5.7: Overview of the compilers evaluated

Compiler Version Part ofMicrosoft Visual C++ 18 Visual C++ 2013Intel C++ Compiler 16 Parallel Studio 2016 XE

33

using library classes that follow the Resource Acquisition Is Initialisation (RAII) idomsuch as std::vector.

It is therefore crucial that those calls execute in parallel, otherwise the scalability ofthe program can severely degrade due to Amdahl’s law. Many frequently used memoryallocators were optimised for “efficient use of memory space and minimisation of CPUoverhead” [KV07]. Some of them make use of locking to guarantee thread safety, andhence serialise object allocation.

Intel Threading Building Blocks (TBB) provides a memory allocator designed formultithreaded programs. TBB’s memory allocator can be used independently of thethreading library used. TBB provides a Dynamic Link Library (DLL) to replace allmemory operations in an application with its own implementation [Inta]. This hasthe advantage that no source code needs be modified. It is sufficient to link to thetbbmalloc_proxy.dll. It requires however, that the Visual C++ Runtime libraryDLL is used rather than linking it statically. SIMLOX previously relied on the lattermethod.

Since some SIMLOX simulation cases consume a significant amount of memory, thequestion arises how much memory consumption increases when using the scalable TBBallocator. The general answer is: only an increase by a few percent can be observed.The detailed results on a per test case basis are shown in section 6.2.

5.6.3 Eliminating Unintended Write SharingWhen performing a replication the worker threads are supposed to only access thesimulation input data structure in a read-only way (see section 5.1.2). Therefore, itcame as a surprise that VTune reported heavy write sharing taking place betweenthe threads. Further analysis revealed that it was the CString class of the MicrosoftFoundation Class Library (MFC) which caused the write sharing.

The CString class implements Copy-on-Write (COW)-based on reference counting.Whenever CString’s assignment operator is called, for example when iterating a listof CStrings, the reference count is modified through the use of atomic operations. Ona multi-core processor this leads to heavy cache coherence traffic (see section 3.1.4).

Beginning with C++11 implementations of std::string are no longer allowedto be based on COW. Hence, std::string seems to be more appropriate for usagein SimInput. However, since the SIMLOX implementation is heavily dependent onCString’s API, it was not feasible to simply eliminate the usage of CString. Instead,an “adapter” class was written which implements the subset of the CString API usedin SIMLOX and delegates all calls to a std::string instance encapsulated in theadapter object. Then, all definitions of CStrings in the SIMLOX source code werereplaced with the new adapter class. Since inlining is enabled (see section 5.6.1), theperformance penalty due to the adapter class should be very low.

Figure 5.6 illustrates the impact of changing the string implementation and showsthat the unintended write sharing was eliminated by using a non-COW-based stringimplementation.

The memory footprint was analysed to ensure that the usage of a non-COW stringdoes not result in significantly higher memory usage. Again no significant increasecould be observed. The detailed results on a per test case basis are shown in section 6.2.

Another question that arises is how big the performance penalty from using stringoperations in SIMLOX in the first place is. One is tempted to think that stringoperations, which are much more expensive than their integer counterparts, present

34

1 Thread CString

8 Threads CString

8 Threads std::string0

2

4

·109C

ount

Hit, Other Core ModifiedHit, Other Core Unmodified

Hit, No Snoop NeededHit, Snoop Miss

Missed, Access to DRAM

Figure 5.6: Chart shows the usage of the L3 cache in difference scenarios. The legendentries are explained in detail in section 3.1.4. When running only one thread, it is perdefinition not possible to find the cache line in another core’s cache. When running inparallel with CString, cache lines are often found in other cores, sometimes even inmodified state. Changing to std::string essentially eliminates this unintended datasharing.

a major inefficiency. However, profiling in VTune shows that after the switch tostd::string, they do not consume a significant amount of time anymore.

5.6.4 Using Profile-Guided OptimisationAt this point VTune reported a high miss rate in the Translation Lookaside Buffer(TLB) (see section 3.1.5). High TLB overhead is usually a sign of poor locality (seesection 3.2.2) and can be improved by reducing the working-set size (see section 5.6.5)and by using Profile-Guided Optimisation (PGO).

PGO is a compiler technique which helps “reorganising code layout to reduceinstruction-cache problems, shrinking code size, and reducing branch mispredictions”[Intc]. PGO improves locality of reference in the code layout by placing code sectionsfrequently used together near each other. This has the effect that fewer memorypages are touched during execution and reduces the amount of entries required in theInstruction Translation Lookaside Buffer (ITLB).

For PGO the compiler requires information not available at compile time, such asfunction execution counts and branches taken. Hence, PGO consists of three steps tofeed back this information to the compiler [Intc]: generating an instrumented binary,run the instrumented binary to gather runtime information and generate the finalbinary by taking into account the runtime information gathered.

The fact that different simulation cases potentially take completely different pathsof execution (see section 5.2.1) presents a major problem for PGO. It is thereforeimportant to execute the profiling step with sufficiently many cases such that enoughcode gets covered. Failing to do so, has resulted in a performance penalty of up to50% in this project.

35

1 Thread8 Threads

8 Threads impr. locality

8 Threads interval_map101

104

107

1010Hit in STLB

Hit in L1Hit in L2Hit in L3

RAM

Figure 5.7: Diagram shows in which cache address translations hit. When runningeight threads instead of a single thread, more lookups have to resort to memory due tothe shared L3 cache. Improving locality helped reducing the amount of lookups slightly,but still there were more memory lookups than in the sequential execution. When usinga boost::icl::interval_map the number of accesses resorting all the way to memorybecame essentially zero.

5.6.5 Eliminating Translation Lookaside Buffer ThrashingThe root cause of poor Translation Lookaside Buffer (TLB) usage persisted after usingProfile-Guided Optimisation (PGO) as described in section 5.6.4. Further analysisshowed that when running multiple threads, more translations miss in all levels ofthe cache hierarchy (see section 3.1.5) and need to resort to memory. The increaseis caused by the threads evicting page table entries used by other threads from theshared Level 3 (L3) cache (see figure 5.7).

Fortunately, most Data Translation Lookaside Buffer (DTLB) misses originatedfrom a single code section, namely the part of the result collection that searches allintervals that overlap with a given mission time (see section 5.1.1 and figure 5.1).In the original version the list of intervals was implemented as a CArray of pointersto interval objects. The search algorithm was linearly searching the array until allintervals were found into which a given mission falls. When a long simulation periodor short intervals are used, this algorithm becomes very inefficient since many intervalobjects need to be accessed during the search. In this process many different memorypages are touched, which leads to a high TLB miss rate. This phenomenon is knownas TLB thrashing.

To improve locality, the data structure containing the interval pointers wasrefactored to a std::vector of objects which guarantees that the elements getconsecutively placed in memory and hence fewer pages are traversed during a search.As anticipated both the number of TLB lookups that miss in all caches and theexecution time were lower after the refactoring. Unfortunately, the number of lookupsthat only hit in memory was still higher in the parallel version compared to thesequential version (see figure 5.7).

36

Of course, linearly searching the complete data structure is not the best possiblealgorithm in the first place. Fortunately, the Boost Interval Container Library(ICL) provides exactly the data structures and algorithms required to performthe search in logarithmic time. Similarly, the working-set size is reduced andfewer pages are touched. Therefore, the data structure was refactored to aboost::icl::interval_map in the second step. This refactoring turned out to bevery effective, since the number of address translation that have to be looked up inmemory became essentially zero (see figure 5.7). Even more, the refactoring reducedthe memory bandwidth consumed significantly as figure 5.10 shows and the applicationbecame mostly cache bound for this test case.

(a) Eight threads linear search

(b) Eight threads using ICL interval_map

Figure 5.8: Impact of using search tree on DRAM bandwidth usage. Grey area is totalbandwidth, green line is read bandwidth and red line is write bandwidth. 51 GB/s isthe maximum DRAM bandwidth on the test system as benchmarked by VTune. Notethat the time axes in the plots are different.

Unfortunately, changing the data structure was not straight forward due to circulardependencies in the SIMLOX object graph. In the original version the constructorof the container elements itself instantiated new objects on the heap to which theconstructor passed its this pointer. The instantiated objects stored the pointer asa member, since some algorithms were dependent on this back reference. This isconceptually shown in figure 5.9a. This was not a problem, as long as the outercontainer only contained pointers, but it makes copying the list elements impossible.This is required for std::vector’s push_back to work, if the container stores objects

37

rather than pointers. Consequentially the circular dependency had to be broken upand all the algorithms that depended on the back reference had to be rewritten (seefigure 5.9b). This pattern is frequent in SIMLOX and makes refactoring of datastructures difficult at times.Class Diagram1 2016/05/02 powered by Astah

a : A*

Bbs : vector<B*>

Aas : vector<A*>

X

(a) Circular dependency before refactoring.Class Diagram2 2016/05/02 powered by Astah

Bbs : vector<B*>

Aas : vector<A>

X

(b) Circular dependency broken up after refactoring.

Figure 5.9: Frequent pattern in SIMLOX code base. Copying instances of A isimpossible in (a), since it would invalidate the a pointer in B. Therefore, the containeras cannot easily be refactored to a container of objects. In (b) the circular dependencyis broken up and copying instances of A becomes possible.

5.6.6 Thread AffinityBy default the Windows scheduling algorithm is free to schedule any thread on any coreor even schedule multiple threads on a single core through Simultaneous Multithreading(SMT). An example of how Windows schedules worker threads is shown in figure 5.10a.As the plot shows, thread migration between cores is frequent. The performancepenalty of thread migration should not be underestimated. Especially, since the cachesand Translation Lookaside Buffers (TLBs) are cold after thread migration and manycache misses occur until they required entries have been brought back into the cache.

For CPU-intensive applications such as SIMLOX it is therefore often beneficial toreduce the amount of thread migration through the use of thread affinity settings. Theyallow worker threads to be “pinned” to a single core on which they can be executedwithout interference from other threads. An execution with thread affinity settingsenabled is shown in figure 5.10b.

C++11 threading facilities lack mechanisms to enable thread affinity. Therefore,the native Windows SetThreadAffinityMask call had to be used.

5.6.7 Shorten Sequential Startup PhaseShift profiles are used to model the availability of resources in SIMLOX, for examplea workshop that is closed on weekends. The algorithm used in SIMLOX to initialiseshift profiles during the sequential startup phase (see section 5.1.2) was implementedinefficiently. This limited the scalability of the wind farm test case due to Amdahl’slaw, since this case makes heavy use of shift profiles [Joh13].

The algorithm linearly searched a list to check the existence of an element forwhich equality of two members is fulfilled. By changing the data structure to an

38

(a) Without thread affinity settings Windows schedules threads freely amongst all cores.

(b) With thread affinity settings threads are pinned to a single core.

Figure 5.10: Parallel execution of eight SIMLOX worker threads with and withoutthread affinity settings enabled. Two consecutive “CPUs” such as cpu_0 and cpu_1are in fact a single SMT-enabled core.

39

std::unordered_map (hashed) and using the two members which are checked as key,the overhead from the sequential phase diminished almost completely.

5.6.8 Refactor Preventive Maintenance Algorithm“Preventive Maintenance Activation” (short: PMActivation) is a concept of theSIMLOX model which is “used to control whether preventive maintenance shall beperformed or not at different locations and/or during different periods in time” [Sys].It is controlled by two parameters: Enable PM (ENPM) and PM Criticality (CRIT),which are set for all PMActivation entries in the input. During execution those twoparameters are frequently queried. Therefore, using the a well suited data structure isimportant.

The algorithms used in SIMLOX were linearly searching all PMActivation entries.This performance issue was previously detected by [HM11] in a wind farm simulationproject. Again the total time spent for this type of search could be significantly reducedby refactoring to a std::unordered_map. It seemed to be the number of searches whichwas problematic rather than the size of the search space, hence contrary to what wasdescribed in section 5.6.5 locality was not an issue. In that sense, this optimisationeliminated a CPU time hot spot.

5.7 Further Optimisations EvaluatedIn this section optimisations which have been carried out during the project work thatdid not result in a significant speedup are discussed.

5.7.1 Auto VectorisationWhen evaluating the best suitable compiler options for SIMLOX in section 5.6.1 theimpact of automatic vectorisation was evaluated. However, no improvement in runtimecould be observed. This is due to the fact that in SIMLOX the majority of the data-intensive loops manipulate either C++ Standard Template Library (STL) or MicrosoftFoundation Class Library (MFC) collections, but function calls usually prevent bothauto and explicit vectorisation (see [JR13]).

5.7.2 Disable Log Content GenerationSIMLOX has a log functionality which allows all events during a simulation to bewritten into a text file. The log functionality is used by SIMLOX end users toverify the correctness of their simulation model but is disabled during long simulationruns. Disabling the log file generation only prevents the very last step, i.e. writinglog data to the file, but in many cases the log contents is generated in memorynonetheless. To measure the impact of this inefficiency, the performance of a testbuild with the log functionality completely disabled was evaluated. Surprisingly nomeasurable performance improvement was observed. Therefore, this seems to be aminor imperfection but not a hot spot consuming significant amount of time.

40

Chapter 6

Results

To demonstrate the impact of the optimisations presented in section 5.6, ten buildsof SIMLOX were created according to table 6.1. The improvements on scalability arediscussed in section 6.1 and the memory footprint is analysed in section 6.2. The latteris important since some SIMLOX cases require a lot of memory to execute. Thereforeit has to be ensured that none of the performance optimisations significantly increasethe memory usage.

For reasons explained in section 3.1.8 Turbo Boost is disabled for all test runs,except in section 6.3 in which the impact of Dynamic Voltage and Frequency Scaling(DVFS) is studied. The effect of Simultaneous Multithreading (SMT) is analysed insection 6.4.

Table 6.1: Overview of all SIMLOX builds tested. The × symbol denotes whichimprovements are included in which build.

Legend Entry

Improvement Sect

ion

base

line

vcpp

icc

tbb

strin

g

pgo

icl

affini

y

star

tup

final

Baseline ×Visual C++ 5.6.1 ×Intel C++ 5.6.1 × × × × × × × ×TBB Memory Allocator 5.6.2 × × × × × × ×Eliminated Write Sharing 5.6.3 × × × × × ×Using PGO 5.6.4 × × × × ×Eliminated TLB Thrashing 5.6.5 × × × ×Thread Affinity 5.6.6 × × ×Shorter Initialisation Time 5.6.7 × ×Refactored Inefficient PM 5.6.8 ×

41

6.1 ScalabilityEvery build (see table 6.1) was tested with every test case (see section 5.2.1) with 1, 2,4 and 8 active worker threads on the test hardware (see section 5.2.2). All test caseswere configured to use eight replications (see section 2.3.1). To minimise the impactof background system activity etc., every run was repeated ten times.

The wall times (time elapsed between start and end of program), correspondingstandard errors, speedup and efficiency metrics are shown for the baseline ( ) andfinal ( ) versions in tabular form. Additionally, two types of plots are used tographically illustrate the results:

Log-log wall time plot with the number of worker threads on the x-axis and theaveraged wall time in seconds on the y-axis. This format was chosen, since evenslight degradations in scalability become clearly visible as an upward rise. Anoptimisation is valid, if it results in a lower absolute value (i.e. faster executiontime) or a straighter falling line (i.e. improved scalability).

Lin-lin speedup plot with the number of worker threads on the x-axis and theaveraged speedup on the y-axis. This type of plot is included to enable aneasy interpretation of the speedup. A straighter rising line indicates improvedscalability. However, this format lacks information on the absolute executiontime.

Test Case: Subway DepotThis test case already had a satisfying scalability in the baseline version (baseline ).As table 6.2 shows the overall runtime could be improved by a factor of approximately1.8 while maintaining parallel efficiency of ≥ 0.94 (see figure 6.2). According tofigure 6.1 this speedup mostly originates from the optimised compiler flags and theimproved data structure for the preventive maintenance algorithm (final ). Theother improvements had a minor impact. Profile-Guided Optimisation (PGO) (pgo

) even worsened runtime slightly.

Table 6.2: Improvement in scalability and runtime for test case Subway Depot.

Sequential 8-way ParallelExecution Execution

Avg

.W

allT

ime

Stan

dard

Err

or

Avg

.W

allT

ime

Stan

dard

Err

or

Spee

dup

Effi

cien

cy

baseline 211.1 0.19 28.2 0.07 7.49x 0.94final 118.9 0.35 15.4 0.17 7.70x 0.96

Improvement 1.78x 1.83x 0.21 0.02

42

1 2 4 8

101.5

102

Number of Threads

Aver

age

Wal

lTim

ein

Seco

nds

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.1: Log-log wall time plot for test case Subway Depot.

1 2 4 8

2

4

6

8

Number of Threads

Aver

age

spee

dup

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.2: Lin-lin speedup plot for test case Subway Depot.

Test Case: Light RailAs figure 6.3 shows, optimising compiler flags (vcpp , icc ) mainly improvedsequential performance, which leads to a drop in speedup from 4.83x to 3.98x and3.82x respectively as figure 6.4 indicates. Using a scalable memory allocator (tbb

) and changing the string class (string ) lead to increased scalability such thatthe speedup becomes 6.33x. Profile-Guided Optimisation (PGO) (pgo ) reducessequential execution time a bit more than parallel, hence the speedup drops again to6.23x in figure 6.4. Refactoring the data structure containing the interval data (icl

) and using affinity settings (affinity ) resulted in a improved execution timeand speedup of 6.88x. The two final enhancements (startup , final ) worsenedruntime a little, since those optimisations target larger test cases than this one. Overallthe speedup was improved from 4.83x to 6.75x while at the same time approximatelycutting the parallel execution time in half. (see table 6.3).

43

Table 6.3: Improvement in scalability and runtime for test case Light Rail.


Avg

.W

allT

ime

Stan

dard

Err

or

Avg

.W

allT

ime

Stan

dard

Err

or

Spee

dup

Effi

cien

cy

baseline 177.1 1.43 36.6 0.10 4.83x 0.60final 113.4 0.12 16.8 0.04 6.75x 0.84

Improvement 1.56x 2.18x 1.92 0.24

1 2 4 8

101.5

102

Number of Threads

Aver

age

Wal

lTim

ein

Seco

nds

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.3: Log-log wall time plot for test case Light Rail.

1 2 4 8

2

4

6

8

Number of Threads

Aver

age

spee

dup

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.4: Lin-lin speedup plot for test case Light Rail.

44

Test Case: Subway Line AFigure 6.5 shows a slight improvement in sequential runtime from optimising compilerflags (vcpp , icc ). Therefore, the speedup drops even below 1.0x in figure 6.6.Using the scalable memory allocator (tbb ) does not lead to any changes, sincescaling was prevented by the Copy-on-Write (COW)-based string class. After switchingto a non-COW string (string ), the speedup jumped to 7.36x. Profile-GuidedOptimisation (PGO) (pgo ) increases the speedup to 7.42x. Even the sequentialruntime is reduced by more than four minutes by this change. Changing the intervaldata structure (icl ) reduces sequential execution time more than it reduces parallelexecution time and leads to a drop in scalability to 7.35x. The final three enhancements(affinity , startup , final ) in combination result in a speedup of 7.77x. Insummary this is a 14x faster parallel execution (baseline vs. final ) (seetable 6.4).

Table 6.4: Improvement in scalability and runtime for test case Subway Line A.


Avg

.W

allT

ime

Stan

dard

Err

or

Avg

.W

allT

ime

Stan

dard

Err

or

Spee

dup

Effi

cien

cybaseline 1071.2 1.18 955.5 1.05 1.12x 0.14

final 517.0 1.36 66.5 0.08 7.77x 0.97Improvement 2.07x 14.37x 6.65 0.83

1 2 4 8

102

103

Number of Threads

Aver

age

Wal

lTim

ein

Seco

nds

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.5: Log-log wall time plot for test case Subway Line A.

45

1 2 4 8

2

4

6

8

Number of Threads

Aver

age

spee

dup

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.6: Lin-lin speedup plot for test case Subway Line A.

Test Case: Subway Line BSimilar to what has been observed in the test case Subway Line A, the scaling wasmainly prevented by the Copy-on-Write (COW)-based string class. Hence, speedupimproved from 1.52x to 6.66x (see figure 6.8, string ) when exchanging the stringimplementation. This is the only test case in which the Microsoft Visual C++ compiler(vcpp ) outperforms Intel C++ compiler (icc ) significantly (see figure 6.7).However, in those two builds the COW string was still present and therefore it ishard to draw an accurate conclusion. Profile-Guided Optimisation (PGO) (pgo )reduces runtime slightly, but comes with a lower speedup of 6.57x, since the sequentialexecution benefits a bit more from the optimisation. The build that includes a moresuitable data structure for intervals (icl ) improves runtime and leads to a speedupof 6.71x. Similarly, both runtime and speedup is improved in the version with shorterstartup time (startup ). The result is a speedup of 7.22x. The final version (final

) does not introduce a change. Overall the parallel execution time improved bymore than a factor of 5 (see table 6.5).

Table 6.5: Improvement in scalability and runtime for test case Subway Line B.


Avg

.W

allT

ime

Stan

dard

Err

or

Avg

.W

allT

ime

Stan

dard

Err

or

Spee

dup

Effi

cien

cy

baseline 779.1 1.91 314.7 2.17 2.48 0.31final 437.1 2.17 60.5 0.04 7.22 0.90

Improvement 1.78x 5.2x 5.05 0.59

46

1 2 4 8

102

102.5

Number of Threads

Aver

age

Wal

lTim

ein

Seco

nds

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.7: Log-log wall time plot for test case Subway Line B.

1 2 4 8

2

4

6

8

Number of Threads

Aver

age

spee

dup

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.8: Lin-lin speedup plot for test case Subway Line B.

Test Case: Rolling StockThis test case had a speedup of 6.50x in the baseline version (baseline )(seetable 6.6). While both the Microsoft Visual C++ compiler (vcpp ) and IntelC++ compiler (icc ) optimisation options improve runtime, the latter seems toperform a bit better (see figure 6.9). The Intel Threading Building Blocks (TBB)scalable memory allocator (tbb ) hardly affects the runtime. An improvementin speedup to 7.65x originates from changing the string class used (string )(seefigure 6.10). The next four versions (pgo , icl , affinity , startup ) allresult in minimal scalability improvements totalling in a speedup of 7.79x. Changingthe data structure used in the preventive maintenance algorithm improved the overallexecution time (final ) significantly, however the speedup drops back to 7.67x sincethe sequential execution speeds up more than the parallel.

47

Table 6.6: Improvement in scalability and runtime for test case Rolling Stock.


Avg

.W

allT

ime

Stan

dard

Err

or

Avg

.W

allT

ime

Stan

dard

Err

or

Spee

dup

Effi

cien

cy

baseline 5046 6.423 777 1.607 6.50 0.81final 2893 9.516 377 0.508 7.67 0.96

Improvement 1.74x 2.06x 1.17 0.15

1 2 4 8102.5

103

103.5

Number of Threads

Aver

age

Wal

lTim

ein

Seco

nds

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.9: Log-log wall time plot for test case Rolling Stock.

1 2 4 8

2

4

6

8

Number of Threads

Aver

age

spee

dup

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.10: Lin-lin speedup plot for test case Rolling Stock.

48

Test Case: Wind FarmIn the baseline version (baseline ) the speedup is 4.26x (see table 6.7). Optimisingcompiler flags mostly benefit sequential execution (vcpp , icc ), and hencea lower speedup is shown in figure 6.12. The scalable Intel Threading BuildingBlocks (TBB) memory allocator improves scalability slightly. A bigger improvementoriginates from factoring out the Copy-on-Write (COW)-based CString class (string

). In the following versions (pgo , icl , affinity ) no improvement forparallel execution can be observed, since scalability was limited by a long sequentialstartup phase (Amdahl’s law). This was fixed by chaining a data structure (startup

) which resulted in a speedup of 7.35x (final ).

Table 6.7: Improvement in scalability and runtime for test case Wind Farm.


Avg

.W

allT

ime

Stan

dard

Err

or

Avg

.W

allT

ime

Stan

dard

Err

or

Spee

dup

Effi

cien

cy

baseline 1803.5 1.15 423.4 0.38 4.26 0.53final 1202.4 2.75 163.6 0.24 7.35 0.92

Improvement 1.50x 2.59x 3.09 0.39

1 2 4 8

102.5

103

Number of Threads

Aver

age

Wal

lTim

ein

Seco

nds

Baselinevcpp

icctbb

stringpgoicl

affinityshiftfinal

Figure 6.11: Log-log scalability plot for test case Wind Farm.

49

1 2 4 8

2

4

6

8

Number of Threads

Aver

age

spee

dup

baselinevcpp

icctbb

stringpgoicl

affinitystartup

final

Figure 6.12: Lin-lin speedup plot for test case Wind Farm.

6.2 Memory FootprintThe memory usage was obtained by querying the “private bytes” [Mic] performancecounter through the Windows Management Instrumentation (WMI) command-lineinterface. The main value of interest is the peak memory usage (see table 6.8), whichshould not increase significantly, since otherwise this could prevent the execution ofsome large SIMLOX cases.

Table 6.8: Peak memory consumption for every test cases in 8-way parallel executionfor a subset of all SIMLOX builds.

Peak Memory Consumption in MB

Test Case Fig

ure

icc

()

tbb

()

strin

g(

)

final

()

Tota

lInc

reas

e

Subway Depot 6.13 159.2 166.4 166.4 170.6 7%Light Rail 6.14 2818.7 2777.4 2773.3 2861.6 2%Subway Line A 6.15 2380.8 2306.4 2321.3 2363.3 -1%Subway Line B 6.16 4264.9 4285.5 4327.5 4436.9 4%Rolling Stock 6.17 4074.0 4238.8 4255.7 4269.4 5%Wind Farm 6.18 8234.3 9201.2 9218.1 9214.1 12%

A phenomenon that can be observed in the memory usage plots (see figure 6.13 to6.18), is that when the Intel Threading Building Blocks (TBB) scalable allocator isused, more memory is occupied after the peak has been reached. This is most likelydue to the fact that the TBB allocator pools memory and uses a more conservativestrategy than the default Windows allocator for returning the memory to the operating

50

system. Since the memory usage data was gathered through the WMI interface, theplots show how much memory has been returned by the allocator to the operatingsystem rather than how much memory has been returned from the application to theallocator.

6.3 Impact of Turbo BoostSince SIMLOX is usually deployed to desktop computers on which Turbo Boost (seesection 3.1.8) is enabled, end users are likely to experience slightly lower scalabilitythan what has been measured in section 6.1 where Turbo Boost was turned off. Forthe test in this section the last run of the test case Light Rail was repeated with TurboBoost enabled. As explained in section 5.2.2, the processor used in the test systemcan run a single threaded workload with up to 3.2 GHz. Whereas when all eight coresare active, the maximum frequency is limited to 2.6 GHz. This compares to the basefrequency of 2.4 GHz which is used when turbo boost is disabled independent of theworkload. As a consequence, the sequential execution experiences a speedup of factor1.9 thanks to Turbo Boost, while the 8-way parallel execution only sees a speedup offactor 1.1. This leads to a lower parallel efficiency as table 6.9 and figure 6.19 shows.

Table 6.9: Impact of turbo boost on sequential and parallel execution of SIMLOX.


Avg

.W

allT

ime

Stan

dard

Err

or

Avg

.W

allT

ime

Stan

dard

Err

or

Spee

dup

Effi

cien

cy

Turbo Boost Disabled 113.4 0.12 16.8 0.04 6.75x 0.84Turbo Boost Enabled 85.9 0.08 15.1 0.15 5.67x 0.71

Difference 1.93x 1.11x -1.08 -0.13

51

0 0.2 0.4 0.6 0.8 1

0.5

1

1.5

·108

Normalised Time

Mem

ory

Usa

gein

Byte icc

tbbstring

final

Figure 6.13: Memory footprint of test case Subway Depot

0 0.2 0.4 0.6 0.8 1

1

2

3·109

Normalised Time

Mem

ory

Usa

gein

Byte icc

tbbstring

final

Figure 6.14: Memory footprint of test case Light Rail

0 0.2 0.4 0.6 0.8 1

0

1

2

·109

Normalised Time

Mem

ory

Usa

gein

Byte icc

tbbstring

final

Figure 6.15: Memory footprint of test case Subway Line A

52

0 0.2 0.4 0.6 0.8 1

0

2

4

·109

Normalised Time

Mem

ory

Usa

gein

Byte icc

tbbstring

final

Figure 6.16: Memory footprint of test case Subway Line B

0 0.2 0.4 0.6 0.8 1

0

2

4

·109

Normalised Time

Mem

ory

Usa

gein

Byte icc

tbbstring

final

Figure 6.17: Memory footprint of test case Rolling Stock

0 0.2 0.4 0.6 0.8 1

0

5

·109

Normalised Time

Mem

ory

Usa

gein

Byte icc

tbbstring

final

Figure 6.18: Memory footprint of test case Wind Farm

53

1 2 4 8

101.5

102

Number of Threads

Aver

age

Wal

lTim

ein

Seco

nds

Turbo Boost DisabledTurbo Boost Enabled

Figure 6.19: Log-log scalability plot illustrating the impact of Turbo Boost on the testcase Light Rail. Turbo Boost improves the runtime, but decreases the scalability.

6.4 Impact of Simultaneous MultithreadingTo evaluate if SIMLOX is suitable for Simultaneous Multithreading (SMT) (seesection 3.1.2) the final version of the software (final ) which includes alloptimisations (see table 6.1) was additionally tested with 1, 2, 4, 8 and 16 activeworker threads. For this purpose the test cases were modified to use 16 replications(see section 2.3.1). Recall that the test hardware (see section 5.2.2) is an eight-coresystem with support for 2-way SMT. Every run was repeated ten times. The resultsare shown in figure 6.20 and table 6.10. The performance impact varies between adegradation of 9.6% and an improvement of 17.0%.

54

Table 6.10: Comparison of 8-way parallel execution and 16-way parallel execution usingSMT for every test case.

8-way Parallel 16-way ParallelExecution Execution (SMT)

Test Case Avg

.W

allT

ime

Stan

dard

Err

or

Spee

dup

Effi

cien

cy

Avg

.W

allT

ime

Stan

dard

Err

or

Spee

dup

Effi

cien

cy

Ben

efit

from

SMT

Subway Depot 30.8 0.04 7.83x 0.98 29.5 0.04 8.20x 0.51 4.7%Light Rail 30.8 0.48 7.07x 0.88 35.5 0.04 6.40x 0.40 -9.6%

Subway Line A 135.3 0.12 7.66x 0.96 125.0 0.06 8.29x 0.52 8.2%Subway Line B 125.8 0.23 7.52x 0.94 127.3 0.07 7.42x 0.46 -1.2%

Rolling Stock 742.5 1.11 7.75x 0.97 634.9 0.67 9.06x 0.57 17.0%Wind Farm 322.1 0.83 7.39x 0.92 352.0 0.29 6.76x 0.42 -8.5%

55

1 2 4 8 16(SMT)101.5

102

Number of Threads

Aver

age

Wal

lTim

ein

Seco

nds

(a) Subway Depot: Improvement: 4.7%

1 2 4 8 16(SMT)101.5

102

Number of Threads(b) Light Rail: Improvement: -9.6%

1 2 4 8 16(SMT)

102.5

103

Number of Threads

Aver

age

Wal

lTim

ein

Seco

nds

(c) Subway Line A: Improvement: 8.2%

1 2 4 8 16(SMT)

102.5

103

Number of Threads(d) Subway Line B: Improvement: -1.2%

1 2 4 8 16(SMT)

103

103.5

Number of Threads

Aver

age

Wal

lTim

ein

Seco

nds

(e) Rolling Stock: Improvement: 17.0%

1 2 4 8 16(SMT)102.5

103

Number of Threads(f) Wind Farm: Improvement: -8.5%

Figure 6.20: Log-log wall time plots illustrating the impact of SMT per test case.

56

Chapter 7

Discussion and Conclusions

Section 7.1 gives a short overview of the problem dealt with, the optimisation processapplied and the main problems identified in this project. Additionally, the resultsfrom chapter 6 are presented in a condensed form. Section 7.2 summarises theoptimisations implemented, findings and insights gained. Lastly, section 7.3 discussessome limitations of this project and suggests further work.

7.1 Summary of Method and ResultsSIMLOX, a discrete-event simulation software, suffered from a number of scalabilitybottlenecks even though the Multiple Replications in Parallel (MRIP) parallelisationstrategy (see section 2.3.1) on which the software is based, in theory promiseslinear scalability. Through profiling and microarchitecture analysis (see section 5.3)the causes were successfully identified and methods to mitigate the problems werepresented. In order to correctly interpret the performance data, it was important touse a sufficiently detailed mental model of a multi-core system during the process (seesection 3.1).

The root cause of the performance issues in SIMLOX was a combination of:

• using libraries and operating system facilities not optimised for multi-core (seesection 5.6.2 and 5.6.3)

• non-optimal compiler settings (see section 5.6.1 and 5.6.4)

• insufficient locality which decreased scalability due to Translation LookasideBuffer (TLB) thrashing (see section 5.6.5)

• code hot spots which consumed a significant amount of time (see section 5.6.7and 5.6.8).

The effectiveness of the suggested optimisations was proven by benchmarking anexperimental implementation. Initially the parallel efficiency when executing SIMLOXusing eight worker threads on an eight-core processor varied between 0.14 and 0.94depending on the test case. The combination of all the optimisations proposed (seesection 5.6) has resulted in both better scalability and shorter execution time. Parallelefficiency was increased to between 0.84 and 0.97 depending on the test case (see

57

figure 7.1). Therefore, it can be concluded that achieving nearly linear speedup ona multi-core processor is possible not only in theory, but also in practice for a fullydeveloped discrete-event simulator and real world simulation cases.

Subway Depot

Light Rail

Subway Line A

Subway Line B

Rolling Stock

Wind Farm0

0.5

1

8-wa

yPa

ralle

lEffi

cien

cy

baselinefinal

Figure 7.1: Parallel efficiency for all the six test cases used in this project before andafter applying all the optimisations presented in this project.

7.2 Discussion of Key FindingsAs a consequence of SIMLOX’s high degree of flexibility, different test cases exposedifferent scalability issues. Similarly, the optimisations presented have varying impactsdepending on the test case as section 6.1 shows. The following general observationscan be made:

1. The biggest improvement in parallel efficiency could generally be observed fromswitching from the Copy-on-Write (COW)-based CString class to the non-COW-based std::string. The atomic operations used in the CString class causedheavy cache coherence traffic in parallel execution which degraded scalability (seesection 5.6.3).

2. A less obvious scalability problem was due to an algorithm that had a low degreeof locality (see section 3.2.2). The algorithm was accessing many memory pagesin rapid succession which caused Translation Lookaside Buffer (TLB) thrashing(see section 5.6.5). Data Translation Lookaside Buffers (DTLBs) are private(fully replicated) to each core, but on a miss, the page table is looked up inthe Level 3 (L3) cache on which the threads are competing (see section 3.1.5).Changing the underlying data structure resolved the problem.

3. Heap memory is so heavily used by SIMLOX that between 10 and 15% of theexecution time is spent inside memory allocation and deallocation functions (seesection 5.6.2). Therefore, it is crucial that a memory allocator optimised formulti-core system is used. Unfortunately, the default implementations shippedwith operating systems are often not fully multi-core ready. Evaluation of theIntel Threading Building Blocks (TBB) scalable memory allocator for SIMLOXyielded promising results.

58

4. A number of compiler optimisation techniques (see section 5.6.1 and 5.6.4). Thistype of optimisation frequently seems to benefit sequential execution more thanit does benefit parallel execution. Hence, if only considering the speedup adegradation seems to take place. This is however not a valid conclusion, sincethe overall wall time still decreases.

5. Relatively simple algorithmic improvements helped reduce the execution timesignificantly. One hot spot was detected in the sequential startup phase ofthe simulation process, which caused bad scalability due to Amdahl’s law (seesection 5.6.7). A second hot spot was eliminated inside the parallel region suchthat the overall wall time could be reduced (5.6.8).

Algorithmic improvements are often much more effective than other optimisationssuch as architectural tuning, since they have the potential of significantly reducingthe amount of work in the first place. [Kuk15] warns that this is often ignored inpractice and reminds that: “no matter how much time is spent tuning a bubblesortimplementation, a quicksort implementation will be faster”.

The peak memory usage increased by up to 10% from the performanceoptimisations introduced (see section 6.2). This additional resource requirementis acceptable given the increased performance. Most of the increase in memoryconsumption is due to the usage of the TBB scalable memory allocator which isdesigned for scalability rather than space efficiency (see section 5.6.2).

The performance benefits from Simultaneous Multithreading (SMT) (seesection 3.1.2) are mixed as section 6.4 showed. In some cases an impressive speedupof 17% could be observed, whereas for other cases the performance decreases. Asdiscussed in section 3.1.2 it is usually not possible to optimise a workload for moreefficient execution on SMT-enabled processors. Hence, determining the optimalnumber of threads for SIMLOX is difficult.

The analysis of the impact of Turbo Boost revealed nothing surprising (seesection 6.3), but it illustrates why in practice end users are likely to see slightly lowerscalability than what the results in this report indicate (see section 3.1.8).

7.3 Limitations and Further WorkThe optimisations presented in this project were only implemented experimentally andhave yet to undergo sufficient testing for inclusion in the next stable SIMLOX release.

To ensure that all SIMLOX end users will experience satisfactory runtimes,performance analysis should be continued with an even bigger number of test cases. Itis likely that number of additional hot spots exist that can be eliminated by changingthe algorithms and data structures. As discussed above, this would likely result in anoticeable performance gain.

This project focused exclusively on optimising SIMLOX for single processorsystems, due to the test hardware available. Therefore, the next step is to repeatthe tests from this project on a multi-processor system to ensure that SIMLOX iscompatible with non-uniform memory access (NUMA) systems. Additionally, it wouldbe interesting and exciting to see if SIMLOX is suitable to run on a second generationstandalone Many Integrated Core (MIC) processor (see section 3.1.7). However, thelow degree of vectorizable code present in SIMLOX could limit compatibility.

59

References

[Ali+15] Dan Alistarh et al. “The SprayList: A Scalable Relaxed Priority Queue”.In: SIGPLAN Not. 50.8 (Jan. 2015), pp. 11–20. doi: 10.1145/2858788.2688523.

[AM10] Tayfur Altiok and Benjamin Melamed. Simulation Modeling and Analysiswith ARENA. Elsevier Science, 2010.

[BCS12] Stephen Blair-Chappell and Andrew Stokes. Parallel Programming withIntel Parallel Studio XE. John Wiley & Sons, 2012.

[Blu+95] Robert D. Blumofe et al. “Cilk: An Efficient Multithreaded RuntimeSystem”. In: SIGPLAN Not. 30.8 (Aug. 1995), pp. 207–216. doi: 10.1145/209937.209958.

[Bor06] Håkan Borgström. “The Support Organization: A Strategic and ValueAdding Force”. MA thesis. Lunds Tekniska högskola, 2006. url: http://lup.lub.lu.se/student-papers/record/1988599.

[CLT97] Wentong Cai, Emmanuelle Letertre, and Stephen J. Turner. “DagConsistent Parallel Simulation: A Predictable and Robust ConservativeAlgorithm”. In: SIGSIM Simul. Dig. 27.1 (June 1997), pp. 178–181. doi:10.1145/268823.268918.

[CW12] Ryan Child and Philip Wilsey. “Dynamically Adjusting Core Frequenciesto Accelerate Time Warp Simulations in Many-Core Processors”. In:Principles of Advanced and Distributed Simulation (PADS), 2012ACM/IEEE/SCS 26th Workshop on. 2012, pp. 35–43. doi: 10 . 1109 /PADS.2012.15.

[EPM99] Greg Ewing, Krzysztof Pawlikowski, and Don McNickle. “Akaroa-2:Exploiting network computing by distributing stochastic simulation”. In:(1999). [Online; accessed 11-February-2016]. url: http://hdl.handle.net/10092/3109.

[EW15] Tomas Eriksson and Mats Werme. “The impact of system redundanciesin the optimization of the support solution”. In: Reliability andMaintainability Symposium (RAMS), 2015 Annual. 2015, pp. 1–6. doi:10.1109/RAMS.2015.7105144.

[Fog15] Agner Fog. Optimizing software in C++: An optimization guide forWindows, Linux and Mac platforms. [Online; accessed 12-April-2016].2015. url: http://www.agner.org/optimize/optimizing_cpp.pdf.

60

http://dx.doi.org/10.1145/2858788.2688523

http://dx.doi.org/10.1145/2858788.2688523

http://dx.doi.org/10.1145/209937.209958

http://dx.doi.org/10.1145/209937.209958

http://lup.lub.lu.se/student-papers/record/1988599

http://lup.lub.lu.se/student-papers/record/1988599

http://dx.doi.org/10.1145/268823.268918

http://dx.doi.org/10.1109/PADS.2012.15


http://hdl.handle.net/10092/3109

http://hdl.handle.net/10092/3109

http://dx.doi.org/10.1109/RAMS.2015.7105144

http://www.agner.org/optimize/optimizing_cpp.pdf

[Fuj98] Richard M. Fujimoto. “Time Management in The High LevelArchitecture”. In: SIMULATION 71.6 (1998), pp. 388–400. doi: 10.1177/003754979807100604.

[Gho14] Sukumar Ghosh. Distributed Systems: An Algorithmic Approach, SecondEdition. Chapman & Hall/CRC Computer and Information Science Series.Taylor & Francis, 2014.

[Haq+11] Mofassir Haque et al. “World-wide Distributed Multiple Replications inParallel for Quantitative Sequential Simulation”. In: Proceedings of the11th International Conference on Algorithms and Architectures for ParallelProcessing - Volume Part II. ICA3PP’11. Melbourne, Australia: Springer-Verlag, 2011, pp. 33–42. doi: 10.1007/978-3-642-24669-2_4.

[HM11] Catarina Hersenius and Ulrika Möller. “Operation and Maintenance ofoffshore wind farms : a study for Systecon AB”. MA thesis. Lunds Tekniskahögskola, 2011.

[HP12] John L. Hennessy and David A. Patterson. Computer Architecture: AQuantitative Approach. Computer Architecture: A Quantitative Approach.Morgan Kaufmann/Elsevier, 2012.

[Inta] Intel. Events for Intel(R) Microarchitecture Code Name Haswell X.[Online; accessed 03-April-2016]. url: https://software.intel.com/en-us/node/589936.

[Intb] Intel. Intel® Turbo Boost Technology in Intel® CoreTM Microarchitecture(Nehalem) Based Processors. [Online; accessed 20-April-2016]. url: http://files.shareholder.com/downloads/INTC/0x0x348508/C9259E98-BE06-42C8-A433-E28F64CB8EF2/TurboBoostWhitePaper.pdf.

[Intc] Intel. Profile-Guided Optimizations Overview. [Online; accessed 2-April-2016]. url: https://software.intel.com/en-us/node/512789.

[Int15] Intel. Intel® 64 and IA-32 Architectures Software Developer’s Manual,Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B, 3C and 3D. [Online; accessed03-April-2016]. 2015. url: http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html.

[Int16] Intel. Intel® 64 and IA-32 Architectures Optimization Reference Manual.[Online; accessed 12-April-2016]. 2016. url: http : / / www . intel . se /content / www / se / sv / architecture - and - technology / 64 - ia - 32 -architectures-optimization-manual.html.

[JAGP12] Deepak Jagtap,Nael Abu-Ghazaleh, and Dmitry Ponomarev. “Optimization of ParallelDiscrete Event Simulator for Multi-core Systems”. In: Parallel DistributedProcessing Symposium (IPDPS), 2012 IEEE 26th International. Shanghai,China, 2012, pp. 520–531. doi: 10.1109/IPDPS.2012.55.

[Joh13] Jeff Johansson. “Operational Validation of SIMLOX as a Simulation Toolfor Wind Energy Operations and Maintenance (O& M)”. MA thesis.Kungliga Tekniska högskolan, 2013. url: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-127516.

[JR13] James Jeffers and James Reinders. Intel Xeon Phi Coprocessor High-Performance Programming. Elsevier Science, 2013.

61

http://dx.doi.org/10.1177/003754979807100604

http://dx.doi.org/10.1177/003754979807100604

http://dx.doi.org/10.1007/978-3-642-24669-2_4

https://software.intel.com/en-us/node/589936


http://files.shareholder.com/downloads/INTC/0x0x348508/C9259E98-BE06-42C8-A433-E28F64CB8EF2/TurboBoostWhitePaper.pdf




http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

http://www.intel.se/content/www/se/sv/architecture-and-technology/64-ia-32-architectures-optimization-manual.html



http://dx.doi.org/10.1109/IPDPS.2012.55

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-127516

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-127516

[Kuk15] Jim Kukunas. Power and Performance: Software Analysis andOptimization. Elsevier Science, 2015.

[KV07] Alexey Kukanov and Michael J Voss. “The Foundations for Scalable Multi-core Software in Intel Threading Building Blocks”. In: Intel TechnologyJournal 11.4 (2007). [Online; accessed 11-February-2016]. url: http :/ / www . intel . com / content / dam / www / public / us / en / documents /research/2007-vol11-iss-4-intel-technology-journal.pdf.

[Lam78] Leslie Lamport. “Time, Clocks, and the Ordering of Events in aDistributed System”. In: Commun. ACM 21.7 (July 1978), pp. 558–565.doi: 10.1145/359545.359563.

[Lee06] Edward A. Lee. “The problem with threads”. In: Computer 39.5 (2006),pp. 33–42. doi: 10.1109/MC.2006.180.

[LGC14] Justin M. LaPre, Elsa J. Gonsiorowski, and Christopher D. Carothers.“LORAIN: A Step Closer to the PDES ’Holy Grail’”. In: Proceedings ofthe 2Nd ACM SIGSIM Conference on Principles of Advanced DiscreteSimulation. SIGSIM PADS ’14. Denver, Colorado, USA: ACM, 2014,pp. 3–14. doi: 10.1145/2601381.2601397.

[Lv+10] Huiwei Lv et al. “P-GAS: Parallelizing a Cycle-Accurate Event-DrivenMany-Core Processor Simulator Using Parallel Discrete Event Simulation”.In: Principles of Advanced and Distributed Simulation (PADS), 2010 IEEEWorkshop on. 2010, pp. 1–8. doi: 10.1109/PADS.2010.5471655.

[Man15] Pekka Manninen. “Performance Engineering”. PDC Summer SchoolLecture. 2015.

[Mar14] Jackson Marusarz. “A Performance Tuning Methodology: From the SystemDown to the Hardware - Introduction”. Argonne Training Program onExtreme-Scale Computing Lecture [Online; accessed 20-April-2016]. 2014.url: http://press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.1A-Performance-Tuning-Methodology_From-the-System-Down-to-the-Hardware_Introduction.pdf.

[Mic] Microsoft. Win32_PerfRawData_PerfProc_Process class. [Online;accessed 28-April-2016]. url: https://msdn.microsoft.com/en- us/library/aa394323(v=vs.85).aspx.

[Mic1616] Microway Inc. Detailed Specifications of the Intel Xeon E5-2600v3“Haswell-EP” Processors. [Online; accessed 20-April-2016]. 2016. url:https://www.microway.com/knowledge-center-articles/detailed-specifications-intel-xeon-e5-2600v3-haswell-ep-processors/.

[MKH91] Eric Mohr, David Kranz, and Robert H. Halstead. “Lazy task creation: atechnique for increasing the granularity of parallel programs”. In: Paralleland Distributed Systems, IEEE Transactions on 2.3 (1991), pp. 264–280.doi: 10.1109/71.86103.

[MRR12] Michael McCool, James Reinders, and Arch Robison. Structured ParallelProgramming: Patterns for Efficient Computation. Elsevier Science, 2012.

[Nut11] James Nutaro. Building software for simulation: theory and algorithms,with applications in C++. Hoboken, N.J.: Wiley, 2011.

62

http://www.intel.com/content/dam/www/public/us/en/documents/research/2007-vol11-iss-4-intel-technology-journal.pdf



http://dx.doi.org/10.1145/359545.359563

http://dx.doi.org/10.1109/MC.2006.180

http://dx.doi.org/10.1145/2601381.2601397


http://press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.1A-Performance-Tuning-Methodology_From-the-System-Down-to-the-Hardware_Introduction.pdf



https://msdn.microsoft.com/en-us/library/aa394323(v=vs.85).aspx

https://msdn.microsoft.com/en-us/library/aa394323(v=vs.85).aspx

https://www.microway.com/knowledge-center-articles/detailed-specifications-intel-xeon-e5-2600v3-haswell-ep-processors/

https://www.microway.com/knowledge-center-articles/detailed-specifications-intel-xeon-e5-2600v3-haswell-ep-processors/

http://dx.doi.org/10.1109/71.86103

[Pas+11] Jonathan Passerat-Palmbach et al. “Warp-level parallelism: Enablingmultiple replications in parallel on GPU”. In: ESM’2011, EuropeanSimulation and Modelling Conference. [Online; accessed 11-February-2016]. Guimaraes, Portugal, 2011, pp. 76–83. url: http://arxiv.org/abs/1501.01405.

[PYM94] Krzysztof Pawlikowski, Victor W. C. Yau, and Don McNickle. “DistributedStochastic Discrete-event Simulation in Parallel Time Streams”. In:Proceedings of the 26th Conference on Winter Simulation. WSC ’94.Orlando, Florida, USA: Society for Computer Simulation International,1994, pp. 723–730. doi: 10.1109/WSC.1994.717420.

[Rah13] Rezaur Rahman. Intel® Xeon Phi™ Coprocessor Architecture and Tools:The Guide for Application Developers. Apress, 2013.

[Sch+15] Markus Schordan et al. “Reversible Computation: 7th InternationalConference, RC 2015, Grenoble, France, July 16-17, 2015, Proceedings”.In: ed. by Jean Krivine and Jean-Bernard Stefani. Cham: SpringerInternational Publishing, 2015. Chap. Reverse Code Generation forParallel Discrete Event Simulation, pp. 95–110. doi: 10.1007/978- 3-319-20860-2_6.

[Sod+16] Avinash Sodani et al. “Knights Landing: Second-Generation Intel XeonPhi Product”. In: IEEE Micro 36.2 (2016), pp. 34–46. doi: 10.1109/MM.2016.25.

[Str] Beeman Strong. “A Look Inside Intel: The Core (Nehalem)Microarchitecture”. Technical Presentation, Intel.

[SYB04] Anthony Sulistio, Chee Shin Yeo, and Rajkumar Buyya. “A taxonomy ofcomputer-based simulations and its mapping to parallel and distributedsystems simulation tools”. In: Software: Practice and Experience 34.7(2004), pp. 653–673. doi: 10.1002/spe.585.

[Sys] Systecon AB. PMActivation - SIMLOX version 8.0b Help.[TY13] Wenjie Tang and Yiping Yao. “A GPU-based discrete event simulation

kernel”. In: SIMULATION 89.11 (2013), pp. 1335–1354. doi: 10.1177/0037549713508839.

[VJT11] James R. Vash, Bongjin Jung, and Rishan Tan. Systems, methods, andapparatus for monitoring synchronization in a distributed cache. US PatentApp. 12/644,506. 2011. url: http : / / www . google . com / patents /US20110153948.

[WA05] Barry Wilkinson and Michael Allen. Parallel Programming: Techniquesand Applications Using Networked Workstations and Parallel Computers.Prentice Hall, 2005.

63

http://arxiv.org/abs/1501.01405

http://arxiv.org/abs/1501.01405

http://dx.doi.org/10.1109/WSC.1994.717420

http://dx.doi.org/10.1007/978-3-319-20860-2_6

http://dx.doi.org/10.1007/978-3-319-20860-2_6

http://dx.doi.org/10.1109/MM.2016.25

http://dx.doi.org/10.1109/MM.2016.25

http://dx.doi.org/10.1002/spe.585

http://dx.doi.org/10.1177/0037549713508839

http://dx.doi.org/10.1177/0037549713508839

http://www.google.com/patents/US20110153948

http://www.google.com/patents/US20110153948

www.kth.se

Performance Optimisation of Discrete-Event Simulation ... › smash › get › diva2:954999 ›...

Documents

Transcript of Performance Optimisation of Discrete-Event Simulation ... › smash › get › diva2:954999 ›...