Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis...

55
Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Linköping University | Department of Computer and Information Science Master thesis, 30 ECTS | Datateknik 2018 | LIU-IDA/LITH-EX-A--18/044--SE Parallelization of Aggregated FMUs using Static Scheduling Mattias Hammar Supervisor : Lennart Ochel Examiner : Peter Fritzon External supervisor : Labinot Polisi

Transcript of Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis...

Page 1: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

Linköpings universitetSE–581 83 Linköping

+46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information ScienceMaster thesis, 30 ECTS | Datateknik

2018 | LIU-IDA/LITH-EX-A--18/044--SE

Parallelization of AggregatedFMUs using Static Scheduling

Mattias Hammar

Supervisor : Lennart OchelExaminer : Peter Fritzon

External supervisor : Labinot Polisi

Page 2: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 årfrån publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstakakopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och förundervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva dettatillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. Föratt garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sättsamt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende elleregenart. För ytterligare information om Linköping University Electronic Press se förlagetshemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement– for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone toread, to download, or to print out single copies for his/hers own use and to use it unchangedfor non-commercial research and educational purpose. Subsequent transfers of copyrightcannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measuresto assure authenticity, security and accessibility. According to intellectual property law theauthor has the right to be mentioned when his/her work is accessed as described above andto be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of documentintegrity, please refer to its www home page: http://www.ep.liu.se/.

c©Mattias Hammar

Page 3: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

Abstract

This thesis implements and evaluates static scheduling for aggregated FMUs. An aggregateFMU is several coupled FMUs placed in a single FMU. The implementation creates taskgraphs from the internal dependencies and connections between the coupled FMUs. Thesetask graphs are then scheduled using two different list scheduling heuristics, MCP andHLFET. The resulting schedules are then executed in parallel by using OpenMP in theruntime. The implementation is evaluated by looking at the utilization of the schedule, theexecution time of the scheduling and the speedup of the simulation. These measurementsare taken on three different test models. With model exchange FMUs only a really smallspeedup is observed. With co-simulation models the speedup varies a lot depending onthe model, the highest achieved speedup was 2.8 running on four cores.

Page 4: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

Preface

This thesis was done at Modelon AB as a part of the EU project DEMOBASE.

iv

Page 5: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

Acknowledgments

Firstly, I would like to thank everyone at Modelon, especially my external supervisor Labinotfor all their help in making this thesis possible. I would also like to thank the staff atLinköpings University for giving me a great education and all of my classmates and themembers of the academic computer club Lysator for making this five of the best years ofmy life.

Lastly, I also want to thank all my family and friends that has supported me in many differentways. I would not have managed this without all of your support.

v

Page 6: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

Contents

Abstract iii

Preface iv

Acknowledgments v

Contents vi

List of Figures viii

List of Tables ix

Listings x

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Functional Mockup Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Theory 43.1 Coupled Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Functional Mockup Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Aggregated FMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 Parallel Computing Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 Multiprocessor Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Method 204.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Test Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.6 Technical Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Results 325.1 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Discussion 40

vi

Page 7: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7 Conclusion 437.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Bibliography 44

vii

Page 8: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

List of Figures

3.1 The different model descriptions of two subsystems, coupled at the behavioraldescription. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Two coupled systems, the dashed lines are direct feed-through and the red linesshow an algebraic loop. The outputs and inputs are denoted with y and u respec-tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.3 A simple system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.4 An example of an FMU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.5 Two coupled systems, one with Co-simulation FMUs and one with Model Ex-

change FMUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.6 The FMI Model Exchange FMU state machine . . . . . . . . . . . . . . . . . . . . . 93.7 The FMI Co-Simulation state machine . . . . . . . . . . . . . . . . . . . . . . . . . . 103.8 An example of an aggregated FMU. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.9 Graph showing the theoretical speedup according to Amdahl’s law. . . . . . . . . . 133.10 A simple Task Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 The workflow that was used during this thesis. . . . . . . . . . . . . . . . . . . . . . 204.2 Task graph for a step sequence for two FMUs. . . . . . . . . . . . . . . . . . . . . . 214.3 Two coupled FMUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 An example task graph from the FMUs in figure 4.3 . . . . . . . . . . . . . . . . . . 234.5 The task graph for the step sequence of the FMUs in figure 4.3 . . . . . . . . . . . . 244.6 Overview of the Race Car model. Note that each wheel has direct feed through

and that the connections are vector valued. c©Modelon . . . . . . . . . . . . . . . . 294.7 Overview of the FMUs in the balanced car aggregate. . . . . . . . . . . . . . . . . . 30

5.1 Speedup graph of the Co-Simulation Race Car Model. . . . . . . . . . . . . . . . . . 355.2 Speedup graph of the Co-Simulation Four Race Cars Model. . . . . . . . . . . . . . 375.3 Speedup graph of the Co-Simulation Balanced Car Model. . . . . . . . . . . . . . . 38

viii

Page 9: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

List of Tables

3.1 An example of a schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Attributes of the task graph in figure 3.10 . . . . . . . . . . . . . . . . . . . . . . . . 17

5.1 Average execution time of the Signal and Wait operations in milliseconds. . . . . . 325.2 Average execution times in milliseconds for the Co-Simulation Race Car model

with a step size of 2ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3 Task graph weights for the Co-Simulation Race Car model. . . . . . . . . . . . . . . 335.4 Average execution times in milliseconds for the Model Exchange Race Car model

with a step size of 2ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.5 Task graph weights for the Model Exchange Race Car model. . . . . . . . . . . . . . 335.6 Average execution times in milliseconds for the Co-Simulation Four Race Cars

model with a step size of 2ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.7 Task graph weights for the Co-Simulation Four Race Cars model. . . . . . . . . . . 345.8 Average execution times in milliseconds for the Co-Simulation Balanced Car

model with a step size of 0.5ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.9 Task graph weights for the Co-Simulation Balanced Car model. . . . . . . . . . . . 345.10 Utilization of the Co-Simulation Race Car model schedules with and without pin-

ning a FMU to a specific core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.11 Runtimes of the Co-Simulation Race Car Model in seconds. . . . . . . . . . . . . . . 355.12 This table shows the total execution time for each FMUs’ step sequence in millisec-

onds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.13 Utilization of the Model Exchange Race Car model schedules with and without

pinning a FMU to a specific core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.14 Runtimes of the Model Exchange Race Car Model in seconds. . . . . . . . . . . . . 365.15 Utilization of the Co-Simulation Four Race Cars model schedules with and with-

out pinning a FMU to a specific core. . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.16 Runtimes of the Co-Simulation Four Race Cars Model in seconds. . . . . . . . . . . 375.17 This table shows the total execution time for each FMUs’ step sequence in millisec-

onds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.18 Utilization of the Co-Simulation Balanced Car model schedules with and without

pinning a FMU to a specific core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.19 Runtimes of the Co-Simulation Balanced Car Model in seconds. . . . . . . . . . . . 385.20 This table shows the total execution time for each FMUs’ step sequence in millisec-

onds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.21 Average runtimes from all test models per graph in milliseconds. . . . . . . . . . . 39

ix

Page 10: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

Listings

3.1 An example of a Step Sequence in a aggregate description with two FMUs. . . 123.2 Pseudo code for a simple semaphore implementation. . . . . . . . . . . . . . . . 133.3 A small OpenMp example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Pseudo code for the general list scheduling algorithm. . . . . . . . . . . . . . . . 163.5 The general cluster scheduling algorith, . . . . . . . . . . . . . . . . . . . . . . . 184.1 An example of a step sequence in the aggregate description with two FMUs. . . 214.2 A divided step sequence without synchronization. . . . . . . . . . . . . . . . . . 214.3 A divided step sequence with synchronization. . . . . . . . . . . . . . . . . . . . 224.4 Pseudo code for the execution of the step sequence in parallel . . . . . . . . . . 224.5 Pseudo code to calculate the static b-level of a task graph. . . . . . . . . . . . . . 254.6 Pseudo code for the Highest Level First With Estimated Times heuristic. . . . . 264.7 Pseudo code to calculate the critical path of a task graph. . . . . . . . . . . . . . 274.8 Pseudo code to calculate the ALAP of a task graph. . . . . . . . . . . . . . . . . 274.9 Pseudo code for the Modified Critical Path heuristic. . . . . . . . . . . . . . . . . 28

x

Page 11: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

1

Introduction

Modeling and simulation is a technique to represent physical systems as abstract models andperform experiments on them. It has become a useful part of many engineering processes asa step between concept design and prototype. Creating real prototypes can be an expensiveand long process, it is often cheaper to create a digital model. The model can be used to verifyand optimize the design before a prototype is built.

In 2010 the first version of the Functional Mockup Interface (FMI) standard was published.It defines a standard way to represent models. A model that follows the FMI standard iscalled a functional mockup unit (FMU). In this thesis an aggregated FMU refers to an FMUthat contains several coupled FMUs internally. The thesis implements and evaluates the useof static scheduling to execute the simulation of aggregated FMUs in parallel on symmetricmultiprocessor systems.

1.1 Motivation

A big problem with performing simulation on digital models is that it can be computationallyheavy. With complex models it can takes several minutes to calculate one second in simula-tion time. For example, the model described in section 4.6 takes about 20 minutes to simulate25 seconds. The faster the simulations can run, the less time is wasted, the simulations canrun on less expensive hardware and the power consumption can be decreased. Therefore, itis important that the simulation tool vendors design the tools as efficient is possible.

A big part of this is optimizing the program for modern hardware. The trend in computerarchitecture for the last decade have been to increase the number of cores in each CPU. This isbecause increasing the clock frequency of a single core has seen a lot of diminish returns. Thediminishing returns are due to something known as the three walls in computer architecture,the memory-, instruction level parallelism- (ILP) and power-wall. The memory-wall refers tothe fact that the memory speeds have not kept up with the CPU performance. The ILP-wallis the increase in difficulty to find enough parallelism in a single instruction stream to keepthe CPU cores busy. The power-wall refers to the fact that a small increase in clock frequencycan increase the power consumption a lot.

To fully utilize a modern CPU it is therefore necessary to design the simulation software touse multiple cores. This thesis will present a solution for how this can be achieved within theFMI standard using static scheduling techniques.

1

Page 12: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

1.2. Aim

1.2 Aim

The aim of this thesis is twofold. The first part is to figure out and show a good way toimplement static scheduling for aggregated FMUs. The second part is to evaluate if staticscheduling is a good method for simulating aggregated FMUs in parallel. This includes ex-ploring what advantages and disadvantages the approach has and what kind of speedupscan be expected.

1.3 Research questions

This thesis has two research questions it will strive to answer. The first question will beanswered by showing an implementation of static scheduling for aggregated FMUs. Thesecond question is to evaluate how well the implementation performs.

1. How can static scheduling be used to simulate an aggregated FMU in parallel?

2. How big of a speedup can be expected by executing an aggregated FMU in parallel?

1.4 Delimitations

This thesis will assume that the simulations are executed on a shared memory processor. Itwill also only discuss parallel simulation in context of the FMI standard. It will not supportmodels that have algebraic loops.

2

Page 13: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

2

Background

This chapter introduces additional background information as a complement to the introduc-tion.

2.1 Functional Mockup Interface

Functional Mockup Interface1 (FMI) is a standard created by the ITEA2 (Information Tech-nology for European Advancement) project MODELISAR. The goal was to support the AU-Tomotive Open system ARchitecture (AUTOSAR) and to develop the FMI standard. Thepurpose of FMI is to standardize model exchange and co-simulation of dynamic models, in-stead of every tool using its own solution. The standard specifies an interface to representmodels. An instance of a model that follows this specification is called a Functional MockupUnit (FMU).[1]

One common use case is that an OEM wants to simulate several models from different sup-pliers coupled together into one larger system. If each supplier uses different modeling toolsto create their models, it can cause incompatibilities between the models and obstruct a suc-cessful simulation of the whole system. Hence it was necessary to create an open standardthat all tools could follow, which became FMI.

The MODELISAR project began in July 2008 and ended in December 2011. The first version1.0 of FMI was published in 2010 and the latest version 2.0 was published in 2014 [2]. Sincethe MODELISAR project has ended, FMI is now maintained and developed by the ModelicaAssociation. [1]

1www.fmi-standard.org

3

Page 14: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3

Theory

This chapter introduces the necessary theory needed to understand the rest of the thesis. Itwill first introduce coupled systems from a general approach, then it will go into the specificimplementation of coupled systems within the FMI standard. After that it will introduce theconcept of an aggregated FMU, which couples several regular FMUs and encapsulates themin a single FMU. The next section will deal with the necessary parallel computing theoryto understand the parallelization parts of this thesis. Then the multiprocessor schedulingproblem will be introduced and the last section will contain some related work.

3.1 Coupled Systems

Simulating and modeling complex engineering systems is a difficult task that requires a lotof work. To simplify the workflow, it is often preferred to divide a large system into severalsmaller subsystems and coupling them together later on. This modular approach makes itpossible to model each subsystem independently of each other in parallel. There are manyadvantages to this approach, for example, the subsystems can be reused with little work,modelers only need to focus on areas in their expertise, the internal workings of each subsys-tem can be hidden.

Figure 3.1: The different model descriptions of two subsystems, coupled at the behavioraldescription.

However, the modular approach also comes with the problem of coupling the subsystemstogether efficiently without compromising the stability of the system. The coupling can bedone at three different abstraction levels, at the physical, mathematical and behavioral modeldescriptions, see figure 3.1. In the physical model description, the system is modeled withphysical parameters, such as mass, resistance, velocity etc. In the mathematical descriptionthe system is described by mathematical equations and in the behavioral description the sys-tem is described by the simulation results from the mathematical equations. Coupling thesubsystems at the physical description is basically the same as the non modular approach.

4

Page 15: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.1. Coupled Systems

Only the mathematical and behavioral model descriptions will be considered in this thesis.[3]

No matter in which abstraction level the system is coupled at, there has to be some consider-ation taken in how this is done. The output of a subsystem can be directly dependent on oneor several inputs of the same subsystem. This is called Direct Feed-through.

Definition 3.1. Direct Feed-through is when an output of a system is directly dependent onan input. [4]

It means that the value of the input directly controls the value of the output. One exampleof direct feed-through is when the output is simply a constant added to an input, but if theoutput is delayed by one time step it is not a direct feed-through. In figure 3.2 the dashed linesdenotes direct feed-through, y[1]1 is for example directly dependent on u[1]

2 . This becomes aproblem if the system contains a loop of direct feed-through connections. It is necessary toknow the value of all directly dependent inputs to calculate the value of an output, this is notpossible if the output is part of a direct feed-through loop. Such a loop is also known as analgebraic loop.

Definition 3.2. An algebraic loop is a loop of connections between the subsystems with directfeed-through.

u[1]

1

u[1]

2

u[2]

1

u[2]

2y[1]

2

y[1]

1 y[2]

1

y[2]

2

Figure 3.2: Two coupled systems, the dashed lines are direct feed-through and the red linesshow an algebraic loop. The outputs and inputs are denoted with y and u respectively.

See figure 3.2 for an example of an algebraic loop. There are ways to eliminate loops, for ex-ample, by adding a filter that removes the direct feed-through property on one of the outputsin the loop [3]. There are also ways to numerically solve the loops, which in general, is a nontrivial task. As stated in section 1.4 algebraic loops are not considered in this thesis.

Mathematical Description

A system coupled at the mathematical description is often called a strongly coupled system.This is only possible if the subsystems expose their internal equations. This section will in-troduce a mathematical description of a coupled system.

A subsystem i can be described by the differential algebraic equation (DAE)

x[i] = f [i](x[i], u[i], t) (3.1)

y[i] = g[i](x[i], u[i], t), (3.2)

where x[i] is the vector of state variables, u[i] is the vector of inputs, y[i] is the vector of outputsand t is time [3]. Note that the superscript i denotes which subsystem the function or variable

5

Page 16: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.1. Coupled Systems

belongs to. To describe a global system of N subsystems, the global input and output vectorsare

y =[y[1] y[2] . . . y[N]

]T(3.3)

u =[u[1] u[2] . . . u[N]

]T. (3.4)

To describe the connections between the subsystems, each subsystem’s input vector u[i] isdefined as a function of the global output vector y.

u[i] = L[i]y =[

L[i]1 . . . L[i]

i´1 0 L[i]i+1 . . . L[i]

N

]

y[1]...

y[i´1]

y[i]

y[i+1]

...y[N]

(3.5)

where N is the number of subsystems and L[i] is the coupling matrix where each element iseither 0 or 1 [3, 4]. Now the global system can be described by vectors of all the subsystemsstate variables x, their outputs y and inputs u.

Simulating a strongly coupled system is equivalent to solving the initial value problem, i.esolving ordinary differential equations given an initial state. There are several numericalmethods to do this, one example is the Runge-Kutta method. [3]

Co-Simulation

Coupling at the behavioral description is most often referred to as weakly coupled systems,simulator coupling or co-simulation. This is when each subsystem uses their own integratorto solve the internal system and then communicates the result to the other subsystems. Theresult of this is that the communication between the subsystem are not continuous, they occurat fixed discrete points, usually called global steps and denoted Tn. There is a need to distin-guish between global steps Tn and local steps t[i]n,m, where Tn = t[i]n,0 ă t[i]n,1 ă . . . ă t[i]n,m = Tn+1. Since a subsystem only knows the value of its inputs at tn,0 for each global step n it raises aquestion on how to deal with the inputs. There is a couple of different ways of solving thisproblem. The easy answer is too keep the inputs constant between each global step. Anothersolution is to extrapolate the input from its previous values. In either case, it is worth notingthat we might reduce the accuracy of the simulation and may introduce instability. [4]

Weakly coupled systems introduce many advantages over strongly coupled systems. Thefact that all subsystems has their own integrator makes it possible to use a domain specificintegrator for each subsystem. More importantly for this thesis, since the systems are moredivided, each subsystem can be simulated in parallel.

Simulation of a co-simulation system is done with something usually called a master algo-rithm. The master algorithm controls the communications between the different subsystemsand the order of execution. There are two main types, the Jacobi type and the Gauß-Seideltypes. When stepping the system forward one step from Ti to Ti+1 every system n has toknow or make an assumption about its inputs u[n](Ti) to calculate their outputs y[n](Ti+1),

6

Page 17: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.2. Functional Mockup Interface

Figure 3.3: A simple system.

where u[n](Ti) denotes the input vector of system n at the time Ti. With an algorithm of theJacobi type, the input is calculated from the connected systems previous output, for example,in figure 3.3 with constant interpolation system 2 would use u[2](Ti+1) = y[1](Ti) to calcu-late y[2](Ti+1). With a Gauß-Seidel type master algorithm the systems would instead use theactual input values, for example system 2 would use u[2](Ti+1) = y[1](ti+1), however thismeans that system 1 has to be solved before system 2 or 3 can know their inputs. Jacobi typemaster algorithms have a bigger potential for simulating the systems in parallel, however,there exists cases where the Gauß-Seidel types converges and the Jacobi types does not.[5]

3.2 Functional Mockup Interface

The Functional Mockup Interface (FMI) is a standard that describes a way to represent mod-els, see section 2.1. A model that follows the standard is called a Functional Mockup Unit(FMU). An FMU is a ZIP archive that contains an XML-file named model description, a setof C functions usually called the runtime and other optional data such as icons, documenta-tion files, libraries etc. The XML file contains information about the model, such as exposedvariables, unit definitions, model structure and more. The set of C functions can either be inC code or binary format and it defines a standardized API to interact with the model. Figure3.4 shows an example of an FMU’s file structure. [1]

Model.fmuBinaries

Win32Model.dll

Linux32Model.so

...Documentation

index.htmlIcon.pngModelDescription.xmlResources

...

Figure 3.4: An example of an FMU.

Two of the most important exposed functions for this thesis are fmi2GetXXX(. . . ) andfmi2SetXXX(. . . ), where XXX denotes the variable type. These functions will from now onbe referred to as Set and Get. The Set function sets the value of an FMU’s input and Getretrieves the value of an output from an FMU.

7

Page 18: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.2. Functional Mockup Interface

There are two different types of FMUs, FMU for model exchange (ME) and FMU for co-simulation (CS). The ME FMUs are intended for strongly coupled systems, they expose theirderivatives so that an external tool can simulate the model. The CS FMUs are intended forweakly coupling systems, they include a solver in the FMU, see figure 3.5 for an overview ofthe difference. Both types of FMUs have some common features and restrictions which areimportant for this thesis. [1].

(a) Co-simulation (b) Model Exchange

Figure 3.5: Two coupled systems, one with Co-simulation FMUs and one with Model Ex-change FMUs.

Feature 3.1. It is possible to get dependency information about which outpus directly de-pends on which inputs from an FMU.

Restriction 3.1. The FMI standard does not guarantee that an FMUs operations are threadsafe.

Feature 3.1 is really important to be able to parallelize the simulation of FMUs. If the depen-dency information is not available there is no way of knowing in which order we can executethe operations safely. On the other hand, restriction 3.1 limits how well the simulations canbe parallelized. It means that it is not possible to execute two operations on the same instanceof an FMU at the same time without risking invalid results.

FMU for Model Exchange

The FMUs for model exchange expose their internal equations to enable an external solver tosimulate the system. This can be used to strongly couple several FMUs together. The FMUsinterface handle ordinary differential equations (ODE) with events, also known as hybridODE. It is described as a piecewise continuous-time system, discontinuities can occur and arecalled events. In between the events the variables are either continuous or constant. Duringthe simulation an ME FMU can be in five different states, see figure 3.6 for the state machine.[6]

Instantiated: The FMU has just been loaded or reset, variables that have an exact orapproximate initial value can be set using the Set operations. To exit this modefmi2EnterInitializationMode(. . . ) is called and the FMU enter the initialization state.

Initialization: During this state the initial values for inputs can be calculated using extraequations that are not accessible in other states. Outputs can also be retrieved using theGet operations. To exit this state fmi2ExitInitializationMode(. . . ) is called and the FMUenters the event mode.

Event Mode: In this state the events are processed, new values for all continuous-time variables and activate discrete-time variables are computed. When

8

Page 19: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.2. Functional Mockup Interface

fmi2NewDiscreteStates(. . . ) is called the FMU will either enter the continuous-timemode or do another iteration of the event Mode.

Continuous-Time Mode: In this state the external solver will calculate the ODEs to step thesystem forward. The integration is stopped when an event is triggered, the solver willcheck the event indicators after each completed integrator step. After an event is trig-gered it will enter the event mode. All discrete-time variables are fixed during thisstate.

Terminated: The simulation has ended and the and the solution at the last time step can beretrieved.

There are four different types of events in an ME FMU.

External Event: An external event is triggered when a continuous time input has a discon-tinuous change, a discrete time input changes value or a parameter is changed.

Time Event: A time event is triggered at a predefined time step.

State Event: A state event is triggered if the FMU enters a specific state, this is done by settingevent indicators.

Step Event: A step event can occur after an integrator step has finished, they typically do notinfluence the model behavior. They are used to ease the numerical integration.

Figure 3.6: The FMI Model Exchange FMU state machine [6]

FMU for Co-Simulation

The second type of FMU is co-simulation (CS), they are used in weakly coupled systems. Inother words, the communication is only done at fixed discrete time points and each FMU ispackaged with their own solver which can be solved independently of each other. The master

9

Page 20: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.3. Aggregated FMU

algorithm steps each subsystem forward one step in time by calling fmi2DoStep(. . . ) on theFMUs, from here on fmi2DoStep(. . . ) will be shortened to DoStep. To control the communica-tion between the FMUs, the Set and Get functions are used. During simulation an CS FMUhas four different states, instantiated, initialization and terminated are basically the same asin the ME case, see the previous section 3.2. However, instead of the continuous-time andevent modes it has an initialized mode. [6]

Initialized: This is where the actual simulation takes place, DoStep is used to calculate thestate for the next global time step. Get and Set are used to pass values between models.The FMU can be in three different states, step complete, step failed or step canceleddepending on the return value of a DoStep call.

See the state machine in figure 3.7 for a more detailed description.

Figure 3.7: The FMI Co-Simulation state machine [6]

3.3 Aggregated FMU

FMI is great for representing models, but the standard does not specify how the informationdescribing the coupling of several FMUs should be stored. One way to solve this problem isby using Aggregated FMUs. An aggregated FMU is a single FMU that contains several FMUsthat are coupled together. This works by creating a new FMU for the aggregate and placing allthe FMUs that are coupled together in its resource directory. Then an aggregate descriptionXML file is added. This file specifies how the FMUs are coupled together. See figure 3.8 foran example. When the aggregate is used, a simulation tool will load the runtime (binaries) ofthe aggregate, the aggregate runtime will parse the aggregate description, load each FMU’sbinary files and run the simulation. It is important to note that the aggregate acts just like anregular FMU, any simulation tools that supports FMI will is able to use an aggregated FMU.

10

Page 21: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.3. Aggregated FMU

Aggregate.fmuBinaries

Win32Aggregate.dll

ModelDescription.xmlResources

AggregateDescription.xmlModel1

BinariesWin32

Model1.dllResources

...Model2

BinariesWin32

Model2.dllResources

......

Figure 3.8: An example of an aggregated FMU.

Aggregate for Co-Simulation

An aggregate for co-simulation is an FMU that consists of several CS FMUs. Remember thatan aggregate look like a regular FMU for the simulation tool. A DoStep call on the aggre-gate will need to simulate the entire coupled system one step forward. In other words, theaggregate runtime needs some kind of a master algorithm to simulate the coupled system.The aggregate description contains call sequences that the aggregate will execute. Each callsequence contains operations that should be executed on a specific FMU. In the CS case anaggregate contains three call sequences.

Enter Initialization Sequence: This call sequence is executed when the FMUs are in the in-stantiated state. It will set variables with approximate or exact initial values and makesure all FMUs enter the initialization state.

Initialization Sequence: This call sequence is executed when the FMUs are in the initializa-tion state. It will move the FMUs from the initialization state to the initialized state.

Step Sequence: This call sequence is executed when the FMUs are in the initialized state.It will move the aggregated forward one global step in time. This is done by usingGet and Set operations to control the communication between the FMUs and DoStep tomove each FMU in the aggregate forward in time.

By doing it this way the runtime is very flexible, for instance both Jacobi and Guaß-seidelmaster algorithms can be used by changing the step sequence. A simple example of a stepsequence with two FMUs can be seen in listing 3.1.

11

Page 22: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.4. Parallel Computing Theory

1 StepSequence :2 Set ( u[1]

1 )3 Set ( u[1]

2 )4 Set ( u[2]

1 )5 DoStep ( 1 )6 DoStep ( 2 )7 Get ( y[1]1 )8 Get ( y[2]1 )

Listing 3.1: An example of a Step Sequence in a aggregate description with two FMUs.

Aggregate for Model Exchange

Coupling model exchange (ME) FMUs and saving them in an aggregate works in a similarway as in the CS case. It defines call sequences in the aggregate description that propagatesthe calls to the aggregate to all FMUs in the aggregate. It has the same enter initialization andinitialization sequences as the CS case, but it also has three other call sequences. Rememberthat in the model exchange case the solver is not included in the FMU. These call sequencesare only for moving the FMUs between different states.

Continuous Sequence: Is executed when the FMU is in the continuous time mode state. Itupdates all inputs that are continuous.

Event Sequence: Is executed when the FMU is in the event mode state.

New Discrete State Sequence: Is executed when fmi2NewDiscreteStates(. . . ) is called.

3.4 Parallel Computing Theory

This section will describe some basic parallel computing theory needed for this thesis.

Amdahl’s law

In parallel computing it is necessary to measure how effective a parallel algorithm is com-pared to the sequential algorithm. This is often done by measuring the speedup of the parallelalgorithm. It is defined as the ratio between the sequential and parallel execution times.

Definition 3.3. Let Ts define the execution time of the fastest sequential algorithm and Tp theexecution time of a parallel algorithm with p processors. Speedup is then defined as

S =Ts

Tp.

In the ideal case the speedup would be equal to p, although in most practical cases it is notpossible to reach an ideal speedup. This is because parallel algorithms add some overheadand it is usually only possible to run certain parts of the algorithms in parallel. It is possibleto calculate the theoretical best possible speedup if the fraction of the algorithm that can beexecuted in parallel is known. This is known as Amdahl’s law.

Definition 3.4. Let p define the factor of the algorithm that can be parallelized, 1´ p is thefactor that cannot and s is the speedup of p. Then Amdahl’s law states that the max possiblespeedup for the entire algorithm is

S(s) =1

(1´ p) + ps

.

[7]

12

Page 23: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.4. Parallel Computing Theory

0 20 40 60 80 100 1200

5

10

15

s

S(s)

p = 0.95p = 0.90p = 0.80

Figure 3.9: Graph showing the theoretical speedup according to Amdahl’s law.

When programming for a specific architecture s is usually fixed, which makes 1´ p the lim-iting factor for the possible speedup. Even a small 1´ p factor can be detrimental for the per-formance, see figure 3.9. It is however worth noting that these are simplifications. Amdahl’slaw does not for example, take into account the added overhead in using parallel algorithms.

Semaphore

In multi-threaded programs it is necessary to have some kind of synchronization between thethreads. It can for instance be because of communication between the threads, code that onlyone thread can execute at once or limitations in the algorithm. To accomplish this there areseveral synchronization primitives, one of these is the semaphore.

A semaphore is a synchronization primitive that is used to control how many threads that canaccess a shared resource concurrently. It has an internal counter and two available operations,Wait and Signal. The Wait operation will put the thread to sleep if the counter is zero orless, if this is not the case then the thread will decrement the counter and continue. TheSignal operation will increment the counter and check if there are any threads waiting for theresource, if that is the case it will wake the first one up. See listings 3.2 for pseudo code.

1 counter = i n i t i a l _ v a l u e23 function Wait ( )4 i f counter i s equal to 05 add process to wait queue and s leep6 decrement counter78 function Signal ( )9 increment counter

10 i f wait queue i s not empty11 wake up the f i r s t thread in the queue

Listing 3.2: Pseudo code for a simple semaphore implementation.

One easy way to think about semaphores is to imagine that the counter as how many re-sources are available, for example, the number of copies of a book in a library. The Waitoperation is then equivalent to either taking a book or waiting until one is available and theSignal operation as someone returning a book.

13

Page 24: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.4. Parallel Computing Theory

OpenMP

When developing a multi-threaded program it is often preferred to use a parallel framework.One of the de facto standard parallel frameworks for shared memory systems is OpenMP.OpenMP specifies a set of library routines, environment variables and compiler directives forC, C++ and Fortran [8].

OpenMP uses a fork-join model, i.e the program is started with a master thread that forks(creates new) threads when it encounters an OpenMP directive. The library routines andenvironment variables can be used to control the runtime, e.g how many threads to run andwhich schedules to use. For more information about OpenMP see the OpenMP specification1,this section will only discuss the directive that is most relevant for this thesis, the Parallel Fordirective. [8]

The Parallel For is used to execute for loops in parallel and is on the form

#pragma omp f o r [ c lause [ [ , ] c lause ] . . . ]for´loops

where a clause is an option. When a thread encounters the directive, it will create a set ofthreads and divide the iterations into chunks. How the chunks are divided and executedamong the threads depends on which schedule is used. OpenMP supports three types ofschedules, static, dynamic and guided. With static scheduling the iterations are dividedinto equal sized chunks and each thread is assigned chunks in a round robin fashion. Withdynamic scheduling the iterations are also divided into equal sized chunks, however, eachthread is only assigned one chunk and has to request a new chunk when it has finished itsprevious chunk. The guided schedule is similar to dynamic, but all chunks are not the samesize. The chunks start out large and get smaller and smaller.[8]

1 A = { 1 , 2 , . . . , 1000}2 #pragma omp for num_threads ( 4 ) schedule ( s t a t i c , 20)3 for each elm in A4 elm = elm ∗ 25 end for

Listing 3.3: A small OpenMp example.

For a small example see listing 3.3. It doubles the value of all elements in an array using 4threads and a static schedule with a chunk size of 20.

Task Graphs

Task graphs is a way to represent computations and their dependencies. It is a directed acyclicgraph (DAG) where each node represents a task, denoted ni, which is a set of instructions thatmust be executed sequentially.

Definition 3.5. A directed acylic graph (DAG) G is a set of nodes V and directed edges E thatcontains no directed cycles.

The edges represent dependencies between the computations and are denoted (ni, nj) for anedge from node ni to nj. Consider the simple task graph in figure 3.10, task C and B can beexecuted in parallel, but neither of them can be executed before task A has finished. Nodesthat have a directed edge to another node are usually called the parents of that node, for

1https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf

14

Page 25: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.5. Multiprocessor Scheduling

A2

B1

C2

D21

1

3

1

Figure 3.10: A simple Task Graph.

example, node A is a parent of node B and C. In the same manner, node B and C are calledchildren of A. Nodes without any parents are called entry nodes and nodes without anychildren are called exit nodes. There is also a need to represent the computation time of atask and the communication costs of the edges.

Definition 3.6. The computation time of a task in a task graph is represented by the weightof a node ni in a task graph and is denoted w(ni)

Definition 3.7. The communication cost between two nodes ni, nj are represented by theweight of the edge between them and is denoted c(ni, nj)

The numbers inside the nodes on figure 3.10 are the computation time and the numbers on theedges are the communication times. Another important aspect of task graphs is the criticalpath.

Definition 3.8. The critical path cp is the path in a task graph G with the largest sum of edgeand node weights.

The critical path corresponds to the longest sequence of tasks that must be executed sequen-tially. There can be more than one if several paths have the same length. No matter howmany processing cores are available, it is not possible to execute a task graph faster than ittakes to execute its critical path. For example, in 3.10 the critical path is A Ñ B Ñ D with alength of 9.

3.5 Multiprocessor Scheduling

Finding an optimal schedule to execute a task graph on a number of processing cores isknown as the multiprocessor scheduling problem. The DAG scheduling algorithms (DSA)can be divided into two different categories, static and dynamic. A static scheduling algo-rithm creates the schedule before the program is executed, and the runtime blindly followsthat schedule. A dynamic scheduling algorithm schedules the tasks during execution, theyare more flexible as they can measure the execution time of the tasks and adapt during run-time, but they are also more complicated and add overhead. In this thesis only static sched-ules are considered. The general multiprocessor scheduling problem is NP-complete anddefined as the following [9].

Definition 3.9. Given a finite set A of tasks, a length I(a) P Z+ for a P A, a number ofprocessors m P Z+ and a deadline D P Z+. Finding a partition A = A1 Y A2 Y ¨ ¨ ¨ Y Am intom disjoints set such that

max(ÿ

aPAi

I(a) : 1 ď i ď m) ď D

is known as the multiprocessor scheduling problem. [9]

15

Page 26: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.5. Multiprocessor Scheduling

For an example of a schedule see table 3.1 which is an optimal schedule of the task graph infigure 3.10.

Table 3.1: An example of a schedule

Step Core 1 Core 20 A -1 A -2 - -3 B C4 - C5 - -6 - -7 D -8 D -

There exist some special cases of the problem that can be solved in polynomial time, but theyare quite restrictive [10]. Most DSA instead use heuristics to achieve near-optimal solutionsin polynomial time. Kwok and Ahmad have done a thorough summary on 27 different DSAthat uses heuristics [10]. The two most common types of DSAs are list and cluster schedulingalgorithms.

List Scheduling

One of the simplest type of static task graph scheduling is list scheduling. There are manydifferent list scheduling algorithms but in the general case it involves two steps. The first isto sort the nodes according to some priority. The second is to loop through the sorted nodesand chose a processor to schedule it on according to some strategy, see listing 3.4.[11]

1 Sor t nodes n P V i n t o a l i s t L according to some p r i o r i t y2 for each n in L do3 Choose a Processor P according to some s t r a t e g y4 schedule n on P5 end

Listing 3.4: Pseudo code for the general list scheduling algorithm.

The schedule step is usually done in one of two different ways, insertion or non insertion. Inthe non insertion way, the task is appended to the chosen processor. If insertion is used thetask can be inserted into a hole in the schedule of the processor. There are many differentways to calculate the node priority, two of the most common are Bottom-level and Top-level[10].

Definition 3.10. Top-level (t-level) of a node ni is defined as one of the longest paths from anentry node to ni [10].

Definition 3.11. Bottom-level (b-level) of a node ni is defined as one of the longest paths fromni to an exit node. Static b-level is a b-level that does not take into account the communicationcosts when calculating the longest path. [10]

Longest path refers to the path with the largest sum of node and edge weights. T-level ishighly correlated with the earliest possible start time of that node. B-level is highly correlatedwith the critical path, the node with the highest b-level is by definition part of a critical path.Another attribute that some algorithms use is as late as possible (ALAP).

16

Page 27: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.5. Multiprocessor Scheduling

Definition 3.12. As late as possible (ALAP) of a node is how far the start time of the node canbe delayed without changing the length of the schedule.

Table 3.2: Attributes of the task graph in figure 3.10

Node T-level B-level Static B-level ALAPA 0 9 6 0B 3 6 3 3C 3 5 4 4D 7 2 2 7

See table 3.2 for attributes corresponding to the graph in figure 3.10. A few examples of listscheduling algorithms are the following.

HLFET Highest level first with estimated times is one of the simplest list scheduling algo-rithms. The list is sorted by static b-level in descending order and starts with only theentry nodes. The nodes are scheduled on the processor that allows for the earliest starttime, after a node have been scheduled all of its children are added to the list. Thecomplexity of HLFET is O(v2) where v denotes the number of nodes. [10]

ETF Earliest time first is quite similar to the HLFET algorithm. The list is also sorted by staticb-level in descending order and only starts with the entry nodes. Then for each node inthe list, the earliest start time is calculated for each processor. The node-processor pairwith the lowest start time is chosen, ties are broken by static b-level. All the children ofthe chosen node are added to the list. The complexity of ETF isO(pv2) where p denotesthe number of processors. [12]

MCP Modified critical path works by first calculating the ALAP of all nodes. Then for eachnode a sorted list l(ni) is created, the node’s and all its children’s ALAP are added to thelist. All of the lists are then sorted in ascending order and a node list L is created fromthat order. This list is then iterated over and each node is scheduled to the processorthat allows the earliest start time, with insertion. MCP has a complexity ofO(v2 log(n)).[13]

Kwok and Ahmad have done a comprehensive benchmark of static scheduling algorithms.They used randomized graphs to test a suite of different DSAs. They compared the each DSAwith respect to the speedup they achieved, how effectively they used the cores and the run-ning time for finding a solution. Of the list scheduling algorithms the clear winner when tak-ing account all these measurements was MCP. However, the difference in speedup betweenthe different list scheduling algorithms was very small. It was mostly MCP’s low complexitythat made it scale better than the other algorithms. ETF was finding shorter schedules thenMCP in some cases but its complexity of O(pv2) caused long running times. HLFET hadsimilar speedups compared to ETF and MCP but it did not scale as well.[14]

Cluster Scheduling

Another common static scheduling approach is cluster scheduling. In the general case itworks by first creating a cluster for each node in the DAG. Then it does incremental improve-ments by merging clusters without increasing the total schedule length. It keeps mergingclusters until it cannot find any more merges that does not increase the schedule length, seelisting 3.5. Since the number of clusters can be larger then the number of available processorsa post processing step of mapping the clusters to processors is also required. In fact, withoutthe post processing step this problem is not NP-Complete, it is possible to find an unbounded

17

Page 28: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.6. Related Work

set of clusters that gives an optimal schedule in polynomial time. However, any realizationof the scheduling algorithm will need to map the clusters to a bounded set of processors. [11]

1 Create a s e t of c l u s t e r s2 Assign each node to a unique c l u s t e r3 repeat4 Choose and merge c l u s t e r s5 i f schedule length got increased a f t e r merge :6 R e j e c t merge7 u n t i l no va l id merge can be found8 Map c l u s t e r s to processors

Listing 3.5: The general cluster scheduling algorith,

One important concept of cluster scheduling is zeroing edges. If two nodes with a connectionbetween them are scheduled to the same core the weight of their edge can be zeroed. This isbecause there is no communication cost between two operations on the same core. This givesan advantage over list scheduling algorithms such as HLFET and MCP. They can follow thecritical path well by prioritizing nodes on ALAP and static b-level, but they do not take intoaccount how the critical path changes if edges are zeroed. [11]

3.6 Related Work

Khaled et al. have written a paper about multi-core simulation using Co-Simulation FMUs[15]. They used the dependency information given by FMI to build a DAG where each nodeis an operation, and each vertex is a data dependency between two nodes. Then they usedan offline scheduling heuristic based on one created by Grandpierre and Sorel that they callRCOSIM [16]. To solve the problem of FMI not being thread safe they scheduled all oper-ations for an FMU on the same core. They also did a case study testing this algorithm on aSpark Ignition RENAULT F4RT engine divided into five FMUs. The model was tested on twoIntel Xeon with 8 cores each running at 3.1 GHz. They got a 10.87 times speedup comparedto running the entire model in a monolithic approach. However, they only received about a1.4 times speedup compared to running the divided model on a single thread.

Another piece of related work is the paper ”Acceleration of FMU Co-Simulation On Multi-core Architecture” by Saidi et al. [17]. In this paper they discuss and implement a parallelmethod to accelerate the simulation of CS FMUs in xMOD2. They used the same method asKhaled et al. to build graphs and schedule them using RCOSIM [15]. In their first approachthey used estimations of the execution times when scheduling the DAG. But they were notsatisfied with this, so they implemented a profiler to get more realistic execution time estima-tions before running the scheduling heuristic. They also tried to solve restriction 3.1 in twodifferent ways. The first solution used a mutex lock for each FMU. The second approachedmodified the scheduling heuristic such that all operations for an FMU got scheduled to thesame core. They tested their solution in xMOD with a simulation of a Spark Ignition RE-NAULT F4RT engine implemented as 5 coupled co-simulation FMUs. They received a ap-proximately 1.3 times speedup with the mutex lock and a 2.4 times speedup with the secondsolution using 5 cores. They also tried it with more than 5 cores but the speedup barelychanged, most likely due to the fact that they used 5 FMUs.

Saidi et al. has also done a paper on using the RCOSIM approach to parallelize the simulationof co-simulation FMUs [18]. They performed two different graph transformations on the taskgraph before scheduling it. The first it to allow for so called multi-rate simulation. Multi-ratesimulation means that each FMU can run with a different step size. The second transforma-

2http://www.xmodsoftware.com/

18

Page 29: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

3.6. Related Work

tion they performed was to solve the problem with FMI not guaranteeing thread safety. Theyadded edged between operations on the same FMU so that only one operation can be run atthe same time. They then compared RCOSIM and their own version of RCOSIM on a sparkignition RENAULT F4RT engine on a Intel core i7 with eight cores at 2.7 GHz. Their modifiedapproach got about 2.9 times speedup while the RCOSIM only got about 1.9 on four cores.

19

Page 30: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4

Method

This chapter describes the methodology used during this thesis. It will first describe theworkflow that was used during the implementation. Then it will show how the parallelsolution was implemented. After that it will describe how the MCP and HLFET heuristicswere implemented. In the next section the models used for evaluation will be presented andthen a section that describes how the evaluation of the implementation was done. The lastsection will discuss the technical limitations of this method.

4.1 Workflow

This section describes the workflow that was used during this thesis. It all starts out withan ssp file. SSP or System Structure Parameterization is a companion standard to FMI. It is aformat for describing the coupling and parameterization of several interconnected FMUs. Toimport the ssp files the backend of Modelon’s product FMI Composer (FMIC) was used. Itparses the ssp file and create a new aggregate FMU from the FMUs and their coupling infor-mation described in the ssp file. To create the aggregate, it creates a model description, aggre-gate description and includes the runtime etc. To simulate the aggregated FMU, PyFMI1 wasused. PyFMI is a Python package for interacting with FMUs. See figure 4.1 for an overviewof the workflow.

SSP Backend (Java)

Aggregated FMU Runtime (C)

PyFMI Simulation results

Figure 4.1: The workflow that was used during this thesis.

4.2 Implementation

This section will describe how the backend and runtime was modified to parallelize the sim-ulation of aggregated FMUs using static scheduling. Looking at the workflow in figure 4.1there was two possible locations where the static scheduling could have been implemented,in the runtime or the backend. Implementing the static scheduling in the runtime would addoverhead since unless some type of caching was implemented, the scheduling would havebeen done every time the aggregated FMU was instantiated. Implementing the static schedul-ing in the backend would only result in overhead during the creation of an aggregated FMU,since the schedule can be saved in the aggregate. The downside of implementing the schedul-ing in the backend is that the number of cores available during the simulation is unknown.

1http://www.jmodelica.org/page/4924

20

Page 31: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.2. Implementation

This is because the computer that creates the aggregate might not be the same computer thatruns the simulation. If it was implemented in the runtime the scheduling algorithm couldalways schedule for the correct number of cores. It was decided to implement the schedulingin the backend, due to the reduced overhead and simpler runtime binaries.

Runtime

The runtime is the binary that is placed in the aggregated FMU. It will parse the aggregatedescription, load all FMUs and expose the FMI API to the simulation tools.

u[1]

u[2]

Step[1]

Step[2]

y[1]

y[2]

Figure 4.2: Task graph for a step sequence for two FMUs.

Since the scheduling was implemented in the backend it was necessary to save the resultingschedule in the aggregated FMU. In the previous sequential solution the aggregate descrip-tion contained several call sequences. It was decided to modify these call sequences to allowfor a parallel schedule. In the old implementation the step sequence for the task graph infigure 4.5 looked like:

1 StepSequence :2 Set ( u[1] )3 Set ( u[2] )4 DoStep ( 1 )5 DoStep ( 2 )6 Get ( y[1] )7 Get ( y[2] )

Listing 4.1: An example of a step sequence in the aggregate description with two FMUs.

In order to parallelize this, it seemed natural to divide the call sequence into several callsequences, one for each core. Dividing the previous step sequence for a parallel solutionusing two cores would then result in the following.

1 StepSequence 1 :2 Set ( u[1] )3 DoStep ( 1 )4 Get ( y[1] )

1 StepSequence 2 :2 Set ( u[2] )3 DoStep ( 2 )4 Get ( y[2] )

Listing 4.2: A divided step sequence without synchronization.

This corresponds well to the output of a static scheduling algorithm, it contains the executionorder for each core. However, it was also necessary to add synchronization between thedifferent cores. Otherwise there is for example nothing that guarantees that the Set(u[1])operation is executed before Get(y[2]), even though there is a dependency between them. Toaccomplish this, a semaphore structure was implemented using the Windows API. Two newoperations were added to the call sequences, Signal and Wait. Both semaphore operationshas a semaphore id as a parameter and the Signal operation also has an increment parameter.

21

Page 32: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.2. Implementation

The increment parameter is an integer that will be added to the semaphore counter when thesignal operation is called. Adding synchronization to the previous example would result in:

1 StepSequence 1 :2 Set ( u[1] )3 Signal ( id = 0 , increment = 1 )4 DoStep ( 1 )5 Get ( y[1] )

1 StepSequence 2 :2 Set ( u[2] )3 DoStep ( 2 )4 Wait ( id = 0 )5 Get ( y[2] )

Listing 4.3: A divided step sequence with synchronization.

The Signal operation in step sequence 1 has the same id as the Wait operation in step sequence2. This ensures that Set(u[1]) is finished before Get(y[2]) is executed. The increment parameterin the Signal is only set to one since there is only one corresponding Wait operation. If therewould have been two Wait operations with id = 0 then the increment parameter would havebeen set to two. The addition of synchronization ensures that all dependencies in a task graphis enforced, as long as the scheduling algorithm inserts the semaphore operations correctly.

Even though this example shows a step sequence the same procedure was used for all thedifferent call sequences. The big difference between the call sequences is how the task graphare created, this is explained in the next section.

The next step was to execute the divided call sequences in parallel. To accomplish this it wasnecessary to either use native threads or a parallel framework. Using native threads usuallygives the programmer more control. But this problem is quite coarse grained and there wasno need for the finer granularity control. It was therefore decided to use OpenMP, the de-factostandard for parallel programming in C. Its coarser granularity fits this problem perfectly.More specifically, OpenMP 2.0 was used, OpenMP 2.0 is 16 years old but the Microsoft VisualStudio Compiler does not support newer versions.

To execute the call sequences in parallel an exec_callseq(. . . ) function was implemented. Itpropagates the operations in the call sequence to the correct FMU. When the sequential im-plementation was in place, the parallelization with OpenMP was trivial. It was enough tocreate a for loop over each call sequence and add an OpenMP directive. Because each callsequence should be executed on its own thread, the parallel for directive was used with astatic schedule and a block size of one. Since the exec_callseq(. . . ) functions return a statuscode it was also necessary to reduce the result to the highest status code. For a pseudo codeexample of the step sequence, see listing 4.4.

1 parse s tep_seqs from d e s c r i p t i o n2 #pragma omp for num_threads ( s i z e ( s tep_seqs ) ) schedule ( s t a t i c , 1 )3 for each step_seq in s tep_seqs4 r e s u l t = e x e c _ c a l l s e q ( step_seq , . . . ) )5 reduce r e s u l t6 end for7

Listing 4.4: Pseudo code for the execution of the step sequence in parallel

Backend

The backend imports an SSP file, creates an aggregated FMU and saves it. This is where thescheduling was done. The scheduling algorithm has three inputs: a converter, the number ofcores to schedule for, and a task graph. The converter was used to convert the nodes to call

22

Page 33: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.2. Implementation

sequence operations that can then be exported to the aggregate description. The number ofcores is given as a command line argument to the backend.

FMU 1 FMU 2 y[2]

1

y[2]

2

u[2]

1y[1]

1

u[1]

2

u[1]

1

Figure 4.3: Two coupled FMUs.

The first step was to create the task graphs. In order to accomplish this, two different typesof dependencies had to be taken into account, the connections between the FMUs and theinternal dependencies in each FMU. The internal dependencies are from an input to an outputon the same FMU, the connections are from an output to an input. In figure 4.3 the internaldependencies are depicted as dashed arrows and the connections as solid arrows. Feature3.1 makes it possible to parse the internal dependencies from each FMU’s model description.The connection dependencies was parsed from the SSP file.

u[1]1

u[1]2 y[1]1 u[2]

1

y[2]1y[2]2

Figure 4.4: An example task graph from the FMUs in figure 4.3

When the variables, dependencies and connections were known it was possible to create thetask graphs. One task graph was created for each call sequence. All call sequences exceptthe step sequence was created in a very similar manner. The biggest differences betweenthem were which variables that was included when the graph was created. When the correctvariables was chosen, creating the graph was trivial. For each variable a node was createdand edges corresponding to the internal dependencies and connections was added. See figure4.4 for a simple example task graph of the aggregate in figure 4.3 where all inputs and outputsare included.

Enter Initialization: This call sequence accomplishes three different steps. The first step isto set all internal inputs that has a starting value, to do this it includes all variables thatare not constant with a exact or approximate initial value in the task graph. After this ithas to call EnterInitalizationMode() on all sub FMUs, however this is not included in thetask graph, instead it is added as a post processing step after the scheduling is done.After it has been called on each sub FMU the call sequence has to make sure each inputconnected to another FMU is updated, it does this by adding all internal outputs andtheir corresponding inputs to the task graph. See the state machine in figure 3.7 formore details in which variables are included.

Initialization: This call sequence is to make sure that all internal inputs and outputs are up-dated after any changes to the external outputs that the simulation tool might havedone. To accomplish this the task graph includes all internal outputs and their corre-sponding inputs.

23

Page 34: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.2. Implementation

Continuous: This call is used when the aggregate is in the Continuous time mode state. Thetask graphs includes all continuous outputs and their corresponding inputs.

Event: The event call sequence will be triggered when EnterEventMode() is called. Here thetask graph contains both the discrete and continuous outputs and their correspondinginputs.

New Discrete State: This call sequence also updates all variables, both continuous and dis-crete. The difference between this and the Event sequence is that NewDiscreteState() iscalled on each sub FMU if any of their discrete inputs has changed. This is however notrepresented in the task graph, it is added in a post processing step after the schedulinghas been completed.

u[1]1 u[1]

2 u[2]1

Step[1] Step[2]

y[1]1 y[2]1y[2]2

Figure 4.5: The task graph for the step sequence of the FMUs in figure 4.3

The step sequence task graph had to be handled differently than all other call sequences. Thestep sequence was implemented using a Jacobi type master algorithm. All inputs are delayedone time step, this means that we can begin the step sequence by setting all inputs to theircorresponding output’s value from the previous time step.

All inputs has to be set before DoStep was called on the same FMU, this means that all inputshad to have an edge to their FMU’s DoStep node. Since the outputs can only be fetched aftertheir FMU’s DoStep operation is completed there also had to be an edge between the DoStepnode and all outputs on the same FMU. Because the internal dependencies were from aninput to an output on the same FMU there was no reason to take them into account duringthis task graph. All inputs already had to be scheduled before any outputs from the sameFMU could.

There should not have been any reason to take into account the connections either since thethreads are synchronized between each execution of a call sequence. But due to an imple-mentation detail in the runtime it was necessary to add dependencies that are the reverse ofthe connections. This was because the Set and Get operations that share a connection usedthe same memory location to store intermediate results. The result of all this can be seen infigure 4.5, the edge from u[1]

2 to y[2]2 is because of their connection.

The last step before scheduling was adding weights to the task graphs. It is very difficult toestimate the weights. The execution time of an FMU operation depends on how the modelis designed and how the runtime of the FMU is implemented. These two factors are notknown during the scheduling. To get accurate weights, a profiler could be used to measurethe execution times. However, there was not enough time to implement a automatic profiler,instead each test model was manually profiled. This was done by creating an aggregate

24

Page 35: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.3. Heuristics

scheduled with uniform weights, this aggregate was then simulated with a modified runtimethat measured the execution time per operation.

When the average execution time of a model had been measured the weights were manuallycalculated. The operations with the lowest average execution time was assigned a weightof one, and all other weights were scaled from that. The execution times of the semaphoreoperations are not dependent on the FMU and was only measured once.When the weightshad been calculated it was time for the scheduling, this is described in the next section.

4.3 Heuristics

When the estimation of node and edge weights was done, the only part left was the schedul-ing. Two different list scheduling algorithms was implemented, HLFET and MCP.

Highest Level First with Estimated Times

The first heuristic that got implemented was HLFET. It follows the basic list scheduling pat-tern discussed in section 3.5 but with some added implementation details. The first step ofHLFET is to create a list of nodes sorted by static b-level. To calculate the static b-level, arecursive function was used, see listing 4.5. It traverses the graph in a depth first manner andsets all exit nodes’ static b-level to their own weight. All other nodes’ static b-level are set totheir own weight added with the largest static b-level of their children.

1 function s b _ l e v e l ( node )2 node . s b _ l e v e l = w(node)3 for each c h i l d in node . ch i ldren4 node . s b _ l e v e l = max( node . sb_ leve l , s b _ l e v e l ( c h i l d ) + w(node) )5 end for6 return node . s b _ l e v e l78 for each node in entry_nodes9 s b _ l e v e l ( node )

10 end for

Listing 4.5: Pseudo code to calculate the static b-level of a task graph.

When the static b-level had been calculated it was time to implement the actual schedulingalgorithm, see listing 4.6. First a list of nodes sorted on the static b-level in descending orderwas created. Each node in this list was then iterated over. The first step for each node waschoosing which core the node should be scheduled on, in HLFET this is done by iterating overeach core and finding the one that had the earliest possible start time. However, in this casewe had one big and important exception to this. Due to the restriction 3.1 it was not possibleto execute several operations on the same FMU in parallel. To solve this, all operations on aspecific FMU was scheduled onto the same core. This was done by keeping a map of FMUsto cores and checking if a node’s FMU was already scheduled to a specific core, see line 9 inlisting 4.6.

The next step after choosing a core was adding synchronization to the node’s parents. Sincethe node cannot be executed before its parents a Wait operation was added for each parent.However, if the parent was scheduled to the same core as the node it was not necessary toadd it, since they are already guaranteed to be executed in order. But it was necessary todecrement the increment parameter of the parent’s corresponding Signal operation. After thewait semaphores were added the node was converted to one or several operations and addedto the cores schedule. If the node had any children, a signal semaphore was also added afterthe operation with increment equal to the number of children.

25

Page 36: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.3. Heuristics

In the final step all children of the node are updated and added to the sorted list. Since achild is not allowed to be scheduled before the node their earliest time was set to the sum ofthe end time of the node and the edge weight. Since the list is sorted by static b-level it isguaranteed that all parents of a node will be scheduled before the node itself is scheduled.

1 c a l c u l a t e s t a t i c b l e v e l of a l l nodes2 sorted_nodes = entry_nodes sor ted on b l e v e l3 cores = l i s t of empty cores4 fmu_core_map = empty map56 for each node in sorted_nodes7 chosen_core = empty core89 i f node . fmu in fmu_core_map :

10 chosen_core = fmu_core_map . get ( node . fmu )11 e lse12 //Find the e a r l i e s t p o s s i b l e s t a r t time without i n s e r t i o n13 for each core in cores14 i f core . a v a i l a b l e < chosen_core . a v a i l a b l e15 chosen_core = core16 end i f17 end for18 fmu_core_map . add ( node . fmu , chosen_core )19 end i f20 chosen_core . a v a i l a b l e = max( chosen_core . a v a i l a b l e , node . e a r l i e s t ) + w( node )2122 //Add wait semaphores to a l l parent scheduled on d i f f e r e n t nodes23 for each parent in node . parents24 i f parent . core != chosen_core25 chosen_core . add ( wait operat ion )26 e lse27 parent . semaphore . increment´28 end i f29 end for3031 //Convert node to operat ion and add to schedule32 chosen_core . add ( convert ( node ) )33 chosen_core . add ( s i g n a l operat ion ( increment= s i z e ( node . ch i ldren ) ) )3435 for each c h i l d in node . ch i ldren36 c h i l d . e a r l i e s t = max( c h i l d . e a r l i e s t , chosen_core . a v a i l a b l e + c ( node , c h i l d ) )37 i f c h i l d not in sorted_nodes38 sorted_nodes . add ( c h i l d )39 end i f40 end for41 end for

Listing 4.6: Pseudo code for the Highest Level First With Estimated Times heuristic.

Modified Critical Path

The second heuristic to be implemented was modified critical path (MCP). MCP is similar toHLFET but it has a important few differences. The first step is to calculate the graphs criticalpath since it is needed in order to calculate the ALAP of each node. To calculate the criticalpath a recursive function traverses the graph depth first. Each exit node returns their ownweight, all other nodes return the largest critical path of their children added with their ownnode weight. This recursive function is called with the entry nodes as input. The largestvalue returned is the critical path, see listing 4.7 for pseudo code.

26

Page 37: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.3. Heuristics

1 function c r i t i c a l _ p a t h ( node )2 max = 03 for each c h i l d in node . ch i ldren4 max = max( c r i t i c a l _ p a t h ( c h i l d ) + c ( node , c h i l d ) , max)5 end for6 return max + w( node )78 path = 09 for each node in entry_nodes

10 path = max( c r i t i c a l _ p a t h ( node ) , path )11 end for

Listing 4.7: Pseudo code to calculate the critical path of a task graph.

To calculate the ALAP of each node another recursive depth first function was used. Thistime the exit nodes set their ALAP to the critical path subtracted with their own weight. Allother nodes set their ALAP to the smallest ALAP subtracted by the edge weight of all theirchildren, and that subtracted with their own weight. This recursive function is called on allentry nodes, see listing 4.8 for pseudo code.

1 function ALAP( node )2 min = c r i t i c a l _ p a t h3 for each c h i l d in node . ch i ldren4 min = min (ALAP( c h i l d ) ´ c ( node , c h i l d ) , min )5 end for6 node . alap = min ´ w( node )7 return node . alap89 for each node in entry_nodes

10 ALAP( node )11 end for

Listing 4.8: Pseudo code to calculate the ALAP of a task graph.

After ALAP has been calculated for each node a list of all nodes sorted by ALAP in ascendingorder was created. Ties were broken by their children’s ALAP, i.e the node that has the childwith the lowest ALAP is chosen. If the nodes can be distinguished by their children the tie isbroken randomly. The sorted list then get iterated over and the nodes gets scheduled one byone. The first step for each node is choosing which core to schedule it on, this is quite similarto HLFET but with one big difference. MCP uses insertion to choose a core, that means thateach core had to keep track of the holes in its schedule. To find the earliest start time all coresand their holes that the node fits in are iterated over. Due to restriction 3.1 all operations on aspecific FMU had to be scheduled onto the same core. To accomplish this, a map of FMUs tocores was used. If a nodes FMU already was scheduled to some core, only that core and itsholes were considered.

If no hole was chosen, the core with the earliest possible start time is chosen and a new holeis created if the node cannot start at that start time. The hole consists of a start time, end timeand an index pointing to where the hole is in the schedule. If a hole was chosen, the index issaved and the hole either decreases in size or is removed.

The rest of the schedule is the same as HLFET except that the operations are inserted at thespecific index instead of last in the schedule. Wait semaphores were added for all parentsand the nodes was converted to operations and inserted into the schedule. Then a signalsemaphore was added and all children’s earliest start time was updated. See listing 4.9 forthe pseudo code.

27

Page 38: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.3. Heuristics

1 c a l c u l a t e ALAP of a l l nodes2 sorted_nodes = All nodes sor ted on ALAP3 cores = l i s t of empty cores45 for each node in sorted_nodes67 chosen_core = empty core8 bes t_hole = none9

10 i f node . fmu in fmu_core_map :11 chosen_core = fmu_core_map . get ( node . fmu )12 for each hole in chosen_core . holes13 i f hole . s i z e >= w( node ) and hole . s t a r t >= node . e a r l i e s t14 bes t_hole = hole15 break16 end i f17 end for18 e lse19 //Find the e a r l i e s t p o s s i b l e s t a r t time with i n s e r t i o n20 for each core in cores21 i f core . a v a i l a b l e < bes t22 chosen_core = core23 bes t = chosen_core . a v a i l a b l e24 bes t_hole = none25 end i f26 for each hole in core . holes27 i f hole . s i z e >= w( node ) and hole . s t a r t < bes t and28 hole . s t a r t >= node . e a r l i e s t29 bes t_hole = hole30 bes t = hole . s t a r t31 chosen_core = core32 end i f33 end for34 end for35 end i f3637 i f bes t_hole == none38 //Create Hole39 i f node . e a r l i e s t > chosen_Core . a v a i l a b l e40 chosen_core . holes . add ( Hole ( chosen_core . a v a i l a b l e , node . e a r l i e s t ´ 1) ) ;41 end i f42 chosen_core . a v a i l a b l e = max( chosen_core . a v a i l a b l e , node . e a r l i e s t ) + w( node )43 index = chosen_core . s i z e ( )44 e lse45 index = bes t_hole . index46 i f bes t_hole . s i z e > w( node )47 chosen_core . holes . add (new hole with smal ler s i z e )48 end for49 chosen_core . holes . remove ( bes t_hole )50 end i f515253 node . core = chosen_core5455 //Add wait semaphores to a l l parent scheduled on d i f f e r e n t nodes56 for each parent in node . parents57 i f parent . core != chosen_core58 chosen_core . add ( wait operat ion at index )59 index++60 e lse61 parent . semaphore . increment´62 end i f63 end for6465 //Convert node to operat ion and add to schedule66 chosen_core . add ( convert ( node ) , index )67 chosen_core . add ( s i g n a l operat ion ( increment= s i z e ( node . ch i ldren ) ) a t index )6869 for each c h i l d in node . ch i ldren70 c h i l d . e a r l i e s t = max( c h i l d . e a r l i e s t , chosen_core . a v a i l a b l e + c ( node , c h i l d ) )71 end for72 end for

Listing 4.9: Pseudo code for the Modified Critical Path heuristic.

28

Page 39: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.4. Test Models

4.4 Test Models

This section presents the test models Race Car, Four Race Cars and Balanced Car which wereused to evaluate the parallel implementation.

Race Car

The first test model is of a Race Car that has been modeled in Modelica using the commercialVehicle Dynamics Library (VDL). The model has a virtual driver that tries to drive in an eightwhile increasing the speed. It can be used to optimize the car’s tires and chassis to drivearound in an eight as fast as possible. It consists of five FMUs, one chassis and four wheels.The FMUs have 172 connections between them, see figure 4.6. The model is available both asa Co-Simulation and a Model Exchange aggregate.

hubFramerimFramespinVelocity

hubFrame1

rimFrame1

spinVelocity1

f1t1

f1 t1

hubFramerimFramespinVelocity

hubFrame2

rimFrame2

spinVelocity2

f2t2

f2 t2

hubFramerimFramespinVelocity

hubFrame3

rimFrame3

spinVelocity3

f3t3

f3 t3

hubFramerimFramespinVelocity

hubFrame4

rimFrame4

spinVelocity4

f4t4

f4 t4

Figure 4.6: Overview of the Race Car model. Note that each wheel has direct feed throughand that the connections are vector valued. c©Modelon

Four Race Cars

The next model is four Race Cars FMUs in a single aggregate without any connections be-tween them. They are the same model as the Race Car from section 4.6 but as a large singlemodel. This model is only available as a Co-Simulation aggregate. Since this model onlycontains four FMUs large equal models it should have a large speedup.

Balanced Car

The last model is also a car from the Vehicle Dynamics Library. It is a different car modelthan the other two models. It was split up into several FMUs that should have close to equalcomplexity. It contains 4 different FMUs, the experiment which controls the car, Wheel whichis the front wheels and suspensions, Wheel_S which is the rear wheels and suspensions anda delay FMU to break up algebraic loops. See figure 4.7 for an overview. This model is onlyavailable as a co-simulation aggregate.

4.5 Evaluation

When the implementation was done it was time evaluate it. This was done in three steps.The first step was to verify that the solution was correct. The second step was to test thestatic scheduling algorithms and the last step was to test the simulation performance of thesolution. All steps were done on a Dell Laptop running Windows 10 with an Intel Core i7-2720QM with a max clock frequency of 3.3GHZ, four physical cores and 8 logical threads.

29

Page 40: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.5. Evaluation

Figure 4.7: Overview of the FMUs in the balanced car aggregate.

Correctness verification

When the implementation was completed, it was necessary to verify the correctness of thesolution. To accomplish this, the models described in section 4.4 were simulated once withthe sequential implementation and once with each heuristic from the new implementation.The results of the simulation was saved to a file and then compared for any differences.

Scheduling Test

The second test is the scheduling test, it was done to compare how well the heuristics per-formed. The test was done with all test models in section 4.4. This test did not involve anysimulation, it is only measuring the performance of the static scheduling. To accomplish thisthree different measurements were taken.

Scheduling Execution Time The execution time of the scheduling algorithm was measuredto compare the complexities between the heuristics. It was measured per task graphand calculated as an arithmetic mean average value of all test models scheduled on oneto four cores.

Utilization The next measurement was the utilization of the schedule, it was calculated asřC

i=0 SL(ci)

max(SL(ci)) ¨ C

where SL(ci) denotes the schedule length of core ci and C is the number of cores used.The measurement shows how well the heuristics used the available computation powerand can be used to compare the heuristics to each other. Note that it does not includeidle time due to holes in the schedule, it is only based on the schedule length.

Utilization without pinning This measurement is the same as the previous one except thatrestriction 3.1 is ignored. In other words, we do not schedule all operations on a spe-cific FMU on the same core, we allow them to be scheduled on any core. It will notbe possible to run a simulation with this schedule but it is an interesting measurementon comparing the heuristics and how much the restriction limits the scheduling algo-rithms.

The measurements were taken three times for all test models in both the co-simulation andmodel exchange case, for two, three and four cores.

30

Page 41: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

4.6. Technical Limitations

Simulation Test

The last, and arguably most important, test is measuring the speedup during simulation withthe parallel implementation compared to the sequential solution. The execution time wasmeasured for all test models in both the sequential implementation and in the parallel im-plementation from one to four cores. When possible, the step size of the simulation was alsovaried. All simulations were executed using PyFMI. In the model exchange case the defaultsimulation settings in PyFMI ere used. That is, PyFMI is using CVode for the solver, BDFas the linear multistep method, and Newton as the nonlinear solver. To reduce the impactof other processes running on the computer, each simulation was executed three times. Theminimum of these three values was then used to calculate the speedup. The minimum valuewas used to discard all runs where some disturbance caused a longer then usual executiontime. The execution time was measured from when the models had been instantiated untilthe simulation had ended.

In order to know if the simulation speedup results are good or not, a rough theoretical es-timate of the best possible speedup was calculated. It is difficult to calculate the portion ofthe program that must run sequentially. To simplify this, the estimate was only calculatedfor each step sequences iteration. With complex models this should be where the programsspends most of its execution time. No theoretical measurements was calculated for the MEmodel. In the optimal case each FMU would have their own thread (due to restriction 3.1).This means that the FMU with the step sequence that hast he longest execution time is thelimiting factor. So the best possible running time for a parallel solution is equal to the execu-tion time of this FMU’s step sequence. In order to calculate the estimate, the total sequentialexecution time of all models is divided by the best possible running time. This is basicallywhat you get when you use Amdahl’s law and let s go towards infinity. The values used tocalculate this estimate are the same that was used to calculate the weights.

Best possible speedup =Total sequential execution time

Best possible execution time

4.6 Technical Limitations

There are a couple of technical limitations to the method used in this thesis. The first has todo with OpenMp’s wait policy. By default, OpenMP uses an active wait policy, this meansthat the threads will actively wait for a short duration after they have finished their work.This can cause a crash when the runtime is unloaded, for example the FreeLibrary windowsfunction does this. Instead it is recommended to use a passive wait policy, this can be doneby the OpenMP environment variable OMP_WAIT_POLICY to PASSIVE.

The second technical limitation has to do with logging during the simulation. If severalthreads write to the log at the same time it can cause issues depending on how the loggingfunction is implemented in the simulation tool. For example, in PyFMI it causes a deadlock,but it works fine with the FMU Compliance Checker2.

The third and last limitation is a DLL dependency issue. Since the semaphores used in theruntime is implemented with windows functions it added an additional dependency. Thisdependency is not included in the aggregate. In other words, the aggregate requires that thevcomp.dll file is available on the system where the simulation is done.

2https://github.com/modelica-tools/FMUComplianceChecker

31

Page 42: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

5

Results

This chapter introduces the results of this thesis. The chapter is divided into two sections,weights and evaluation. The Weights section shows the weights used for each test model.The evaluation section presents the results from evaluation of the implementation.

5.1 Weights

This section shows how the weights were chosen. The execution times that were used tocalculate the edge weights can be seen in table 5.1. It was measured from simulating a Co-Simulation Race Car with a step size of 2ms scheduled to two cores with HLFET. It is worthnoting that the execution time might include some waiting time, hence these times are likelyan overestimate.

Table 5.1: Average execution time of the Signal and Wait operations in milliseconds.

Operation Average CountSignal 2.13 16598Wait 82.9 16598Combined 85.0 33196

Race Car

Table 5.2 shows the execution time of each operation for the Co-Simulation Race Car modelafter three seconds of simulation time. The step size used was 2ms. The Chassis FMU Setoperations is the smallest value, therefore, its weights are set to one. The DoStep operationshave a lot longer execution times than the Set and Get operations. See the resulting weightsin table 5.3.

Table 5.2: Average execution times in milliseconds for the Co-Simulation Race Car modelwith a step size of 2ms.

FMU Set Get DoStep Enter InitChassis 0.071e´2 0.731e´2 156.2 8.66Wheel 1 0.243e´2 0.119e´2 3.94 0.472Wheel 2 0.200e´2 0.153e´2 4.23 0.466Wheel 3 0.194e´2 0.174e´2 5.25 0.440Wheel 4 0.217e´2 0.160e´2 4.39 0.454

32

Page 43: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

5.1. Weights

Table 5.3: Task graph weights for the Co-Simulation Race Car model.

FMU Set Get DoStep EdgeChassis 1 10 220000 120000Wheels 3 2 6000 120000

Table 5.4 shows the execution time of each operation for the Model Exchange Race Car modelafter three seconds of simulation time. The step size used was 2ms. The Get operationshave quite a bit longer execution times than the Set operations but the edge weight will beseveral magnitudes larger than the closest wait with its 85ms execution time. The weightswas calculated in the same way as for the Co-Simulation Race Car, see in table 5.5.

Table 5.4: Average execution times in milliseconds for the Model Exchange Race Car modelwith a step size of 2ms.

FMU Set Get Enter Init New DiscChassis 0.401e´3 0.00999 8.939 7.46Wheel 1 0.404e´3 0.0137 0.389 0.471Wheel 2 0.431e´3 0.0135 0.444 0.484Wheel 3 0.420e´3 0.0134 0.434 0.473Wheel 4 0.401e´3 0.0133 0.407 0.467

Table 5.5: Task graph weights for the Model Exchange Race Car model.

FMU Set Get EdgeChassis 1 25 212000Wheels 1 34 212000

Four Race Cars

Table 5.6 shows the average execution times for each operation on a Co-Simulation Four RaceCars model. The times are in milliseconds and the step size during the simulation was 2ms.Since this model does not have any connections between the FMUs, no Get operations wereused. The DoStep operations are several magnitudes larger than the Set operations and morethan twice as long as the Semaphore operations. From these execution times the weights wascalculated, shown in table 5.7.

Table 5.6: Average execution times in milliseconds for the Co-Simulation Four Race Carsmodel with a step size of 2ms.

FMU Set Get DoStep Enter InitRace Car 1 0.395e´3 - 203.1 20.2Race Car 2 0.402e´3 - 202.8 21.0Race Car 3 0.401e´3 - 202.6 20.5Race Car 4 0.404e´3 - 204.4 20.1

33

Page 44: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

5.2. Evaluation

Table 5.7: Task graph weights for the Co-Simulation Four Race Cars model.

FMU Set Get DoStep EdgeRace Cars 1 - 500000 200000

Balanced Car

See table 5.8 for the average execution times of the Co-Simulation Balanced Car model. Theexecution times were measured during a simulation with a step size of 0.5ms. None of theFMU operations come close to the semaphore operations, the closest are the DoStep on bothWheels FMUs. The Wheels Get operations has the smallest execution time and its weight isset to one. See table 5.9 for all weights.

Table 5.8: Average execution times in milliseconds for the Co-Simulation Balanced Car modelwith a step size of 0.5ms.

FMU Set Get DoStep Enter InitWheels S 0.698 0.0842 3.712 9.436Experiment 0.292 0.269 0.376 1.146Wheels 0.652 0.0554 3.180 8.242Delays 0.143 0.0700 1.593 0.315

Table 5.9: Task graph weights for the Co-Simulation Balanced Car model.

FMU Set Get DoStep EdgeWheels S 13 2 67 1500Experiment 5 5 7 1500Wheels 12 1 57 1500Delays 3 1 29 1500

5.2 Evaluation

This section will present the results of the evaluation.

Race Car

The first step in the evaluation was to measure the utilization of each generated schedule.See table 5.10 for the utilization results for the Co-Simulation Race Car model. With pinning,there is not a big difference between MCP and HLFET. HLFET performs slightly better in theInitialization and Enter Initialization graphs but worse in the DoStep graph. With no pinning,most schedules are close to 100% except for the HLFET DoStep graphs and the MCP DoStepgraph to four cores.

Table 5.10: Utilization of the Co-Simulation Race Car model schedules with and withoutpinning a FMU to a specific core.

Schedule Pinning No Pinning2 3 4 2 3 4

Enter Init HLFET 0.874 0.832 0.810 1 1 1Enter Init MCP 0.876 0.834 0.813 1 1 1Init HLFET 0.874 0.832 0.810 1 1 1Init MCP 0.876 0.834 0.813 1 1 1Step HLFET 0.786 0.698 0.653 0.868 0.761 0.704Step MCP 0.713 0.617 0.565 1 1 0.791

34

Page 45: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

5.2. Evaluation

Table 5.11: Runtimes of the Co-Simulation Race Car Model in seconds.

Run 1 Run 2 Run 3 MinimumSequential 1372.0 1377.4 1372.7 1372.0HLFET 1 1344.7 1341.7 1342.0 1341.7HLFET 2 1197.3 1200.0 1198.4 1197.3HLFET 3 1205.2 1199.8 1201.9 1199.8HLFET 4 1197.3 1198.7 1198.3 1197.3MCP 1 1338.1 1340.2 1337.0 1337.0MCP 2 1270.5 1275.0 1273.3 1270.5MCP 3 1228.2 1228.8 1227.6 1227.6MCP 4 1244.7 1246.6 1245.3 1244.7

The second step was to measure the speedup of a simulation. The model was simulatedfor 25 seconds in simulation time with a step size of 2ms. The resulting execution times areshown in table 5.11. The speedup was then calculated by dividing the sequential solution’sexecution time with all other execution times. See figure 5.1 for the resulting speedup graph.The highest measured speedup was about 15%.

1 2 3 41

1.25

1.5

1.75

2

Cores

Spee

dup

HLFETMCP

Figure 5.1: Speedup graph of the Co-Simulation Race Car Model.

To calculate the best possible theoretical speedup of each step sequence iteration the executiontimes from table 5.2 was used. From this data each FMUs’ step sequence execution time wascalculated. The results are shown in table 5.12. From this data we can calculate the theoreticalspeedup of a step sequence iteration.

158.1176.5

« 1.1 (5.1)

Table 5.12: This table shows the total execution time for each FMUs’ step sequence in mil-liseconds.

Chassis Wheel 1 Wheel 2 Wheel 3 Wheel 4 SumExecution Time 158.1 4.1 4.4 5.4 4.5 176.5

The utilization results for the Model Exchange Race Car model can be seen in table 5.13.Without pinning, almost all schedules got close to 100% utilization. The only case where itdid not was the Enter Initialization and Initialization graphs for two and three cores. Withpinning, all schedules got close to 100% utilization.

To measure the speedup of the model exchange Race Car model it was simulated for 25 sec-onds of simulation time with a step size of 2ms. The resulting execution times can be seenin 5.14. The sequential solution performed a lot worse than all the heuristic, even when they

35

Page 46: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

5.2. Evaluation

Table 5.13: Utilization of the Model Exchange Race Car model schedules with and withoutpinning a FMU to a specific core.

Schedule Pinning No Pinning2 3 4 2 3 4

Enter Init HLFET 1 1 0.812 1 1 1Enter Init MCP 1 0.843 0.813 1 1 1Init HLFET 1 1 0.812 1 1 1Init MCP 1 0.834 0.813 1 1 1Cont HLFET 1 1 1 1 1 1Cont MCP 1 1 1 1 1 1Event HLFET 1 1 1 1 1 1Event MCP 1 1 1 1 1 1New Disc HLFET 1 1 1 1 1 1New Disc MCP 1 1 1 1 1 1

were only scheduling for one core. Because of this the speedup was not measured for thismodel.

Table 5.14: Runtimes of the Model Exchange Race Car Model in seconds.

Run 1 Run 2 Run 3 MinimumSequential 4043.3 4030.4 4029.7 4029.7HLFET 1 2018.7 2028.0 2031.0 2018.7HLFET 2 2015.9 2029.1 2019.2 2015.9HLFET 3 2035.1 2054.6 2054.4 2035.1HLFET 4 2037.9 2055.7 2065.1 2037.9MCP 1 1944.0 1943.9 1964.9 1943.9MCP 2 2068.0 2083.6 2076.5 2068.0MCP 3 2028.4 2038.4 2040.7 2028.4MCP 4 1998.3 2022.9 2001.6 1998.3

Four Race Cars

The utilization results Co-Simulation Four Race Cars model can be seen in table 5.15. Thereis no noticeable difference between the MCP and HLFET heuristics.

Table 5.15: Utilization of the Co-Simulation Four Race Cars model schedules with and with-out pinning a FMU to a specific core.

Schedule Pinning No Pinning2 3 4 2 3 4

Enter Init HLFET 1 0.667 1 1 1 1Enter Init MCP 1 0.667 1 1 1 1Init HLFET 1 0.667 1 1 1 1Init MCP 1 0.667 1 1 1 1Step HLFET 1 0.667 1 1 0.667 1Step MCP 1 0.667 1 1 0.667 1

The execution times were measured during a simulation for 25 seconds of simulation timewith as step size of 2ms. The resulting execution times is in table 5.16. MCP and HLFETperformed very similarly. The speedups are shown in the graph in figure 5.2. The highestachieved speedup was about 2.8.

36

Page 47: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

5.2. Evaluation

Table 5.16: Runtimes of the Co-Simulation Four Race Cars Model in seconds.

Run 1 Run 2 Run 3 MinimumSequential 5985.9 6002.5 6037.8 5985.9HLFET 1 5850.0 5878.2 5875.9 5850.0HLFET 2 3190.0 3181.1 3197.7 3181.1HLFET 3 3183.8 3205.1 3195.4 3183.8HLFET 4 2098.9 2121.2 2126.2 2098.9MCP 1 5860.4 5869.8 5866.3 5860.4MCP 2 3182.5 3178.0 3181.5 3178.0MCP 3 3183.0 3193.1 3192.5 3183.0MCP 4 2103.8 2119.4 2125.5 2103.8

1 2 3 41

2

3

4

Cores

Spee

dup

HLFETMCP

Figure 5.2: Speedup graph of the Co-Simulation Four Race Cars Model.

To calculate the best possible theoretical speedup of each step sequence iteration the executiontimes from table 5.6 was used. From this data each FMUs’ step sequence execution time wascalculated. The results are shown in table 5.17. From this data we can calculate the theoreticalspeedup of a step sequence iteration.

813.8204.5

« 4.0 (5.2)

Table 5.17: This table shows the total execution time for each FMUs’ step sequence in mil-liseconds.

Race Car 1 Race Car 2 Race Car 3 Race Car 4 SumExecution Time 203.2 202.9 202.7 204.5 813.8

Balanced Car

The utilization results for the Co-Simulation Balanced Car model can be seen in table 5.18.Which heuristic got better utilization varies a lot from schedule to schedule.

To measure the speedup, the model was simulated for 25 seconds in simulation time with astep size of 0.5ms. The measured execution times are shown in table 5.19. MCP performeda lot better than HLFET when scheduled to three cores, otherwise they were very close. Theresulting speedups are shown in figure 5.3

To calculate the best possible theoretical speedup of each step sequence iteration the executiontimes from table 5.8 was used. From this data each FMUs’ step sequence execution time was

37

Page 48: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

5.2. Evaluation

Table 5.18: Utilization of the Co-Simulation Balanced Car model schedules with and withoutpinning a FMU to a specific core.

Schedule Pinning No Pinning2 3 4 2 3 4

Enter Init HLFET 0.969 0.908 0.747 1 1 1Enter Init MCP 0.977 0.837 0.696 1 1 1Init HLFET 0.969 0.908 0.747 1 1 1Init MCP 0.977 0.837 0.793 1 1 1Step HLFET 0.905 0.935 0.769 1 0.963 0.915Step MCP 0.984 0.931 0.769 0.991 0.946 0.929

Table 5.19: Runtimes of the Co-Simulation Balanced Car Model in seconds.

Run 1 Run 2 Run 3 MinimumSequential 1005.2 1000.9 994.0 994.0HLFET 1 928.2 923.5 927.0 923.5HLFET 2 586.3 589.7 584.0 584.0HLFET 3 562.9 565.0 561.1 561.1HLFET 4 388.5 389.1 390.9 388.5MCP 1 931.4 928.9 920.5 920.5MCP 2 611.5 608.4 612.5 608.4MCP 3 386.1 387.2 389.6 386.1MCP 4 382.9 388.5 384.7 382.9

1 2 3 41

2

3

4

Cores

Spee

dup

HLFETMCP

Figure 5.3: Speedup graph of the Co-Simulation Balanced Car Model.

calculated. The results are shown in table 5.20. From this data we can calculate the theoreticalspeedup of a step sequence iteration.

28.69.8

« 2.9 (5.3)

Table 5.20: This table shows the total execution time for each FMUs’ step sequence in mil-liseconds.

Wheels S Experiment Wheels Delays SumExecution Time 9.9 6.5 8.8 3.2 28.6

38

Page 49: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

5.2. Evaluation

Runtimes

The average execution times of the scheduling algorithms are shown in table 5.21. The timesare an average of all test models scheduled to one to four cores. MCP is slower for all taskgraphs, sometimes by almost two times as much.

Table 5.21: Average runtimes from all test models per graph in milliseconds.

Enter Init Init Step Cont New Disc EventHLFET 53.70 27.86 0.472 1.30 2.07 1.25MCP 78.879 54.55 0.850 1.83 2.83 1.97

Correctness Verification

Each test model was simulated and the results compared to the sequential code. There wasno difference in the results.

39

Page 50: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

6

Discussion

This chapter will discuss the thesis. It will first discuss the result, then it will discuss themethod used during the thesis and lastly it will put the work in a wider context.

6.1 Results

This section will discuss the results presented in the previous chapter. The section is dividedinto one subsection for each model. But first, lets discuss the parts that are common for allmodels. Looking at the execution times for the semaphores operations in table 5.1 it seemsto be quite high. The wait operation takes over 80ms on average, this seems far to long forto only be a cause of the synchronization overhead. The problem is that since we measureit during a simulation, the measurements includes the waiting time for the other thread tocomplete its work. A better solution would be to measure the semaphore operations whenno other work is being done.

Another weird aspect of the data is that the sequential solution is slower than the onethreaded parallel solutions for all models. For the co-simulation models the difference isnot that big, see tables 5.11, 5.16, 5.19. However, for the model exchange Race Car model thedifference is huge (see table 5.14), the sequential execution time is almost twice as long. Atfirst appearance it seemed like something must be wrong with the parallel implementation.But the results of the simulations were the same. The difference seems to be from the factthat the order of operations does matter for the performance of the simulation [19]. This wastested by using the results from the scheduling algorithms for one core as an input to thesequential runtime implementation. This produced almost identical execution times as theparallel algorithm with the same schedule.

Co-Simulation Race Car

Looking at the weights in table 5.3 we see that the FMUs in the Race Car are unbalanced.The chassis FMU’s DoStep operation is several magnitudes more expensive than all otheroperations except the semaphore operations. This is a bad sign for the possible speedup withthe parallel solution. Since all operations on the Chassis FMU has to be scheduled to the samecore we get an unbalanced schedule.

This is confirmed when looking at the utilization data in table 5.10. The important schedulefor simulation speedup is the DoStep schedule which has quite low utilization. It is worthnoting here that HLFET achieves better utilization than MCP and consequently also has abetter speedup. However, looking at the utilization data without restriction 3.1 we see thatMCP performs a lot better than HLFET. This is probably because it uses insertion. It shows

40

Page 51: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

6.1. Results

that the restriction has a large impact on performance with this model. Going from 61%utilization on two cores to 100% would probably increase the speedup a lot.

Looking at the speedup graph in figure 5.1 we see very poor speedups The highest speedupwas 14.6% and was achieved with two cores. The lower speedup with three and four cores ismost likely due to the increased overhead of additional threads and semaphores. Comparingthe speedup to the theoretical estimation in equation 5.1 we see something odd. The actualspeedup is larger than the theoretical estimate. This shows how bad the manual profiling isat guessing the execution times. It only measure the execution time during a short simulationand the execution times might very well change during different points of the simulation.

Model Exchange Race Car

Looking at the weights and execution times in tables 5.5 and 5.4 for the model exchange RaceCar we see that the operations have quite a low execution time and that the edge weights areseveral magnitudes larger than the closest node weight. This is bad for the possible speedup.This is because the communication costs are zero when all operations are scheduled to onecore.

Looking at the utilization data in table 5.13 we see that it is close to 100% for almost allschedules. This is a good sign the possible speedup. However, looking at the execution timein table 5.14 of the simulations we notice two things. The first is that the sequential solutionis a lot slower than both heuristics scheduled to one core, as discussed earlier. The secondthing we notice is that no almost no speedup for the simulation, the best execution time isMCP scheduled to one core. This is not all that surprising, at this abstraction level there is nota lot that we can parallelize in the model exchange case. To parallelize model exchange it isprobably a better idea to start using a parallel solver or to parallelize the implementation at adifferent abstraction level.

Four Race Cars

Looking at the weights of the Four Race Cars co-simulation model in table 5.7 we see that wehave four equal FMUs with quite a long DoStep execution time. This should give very goodparallel performance. Looking at the utilization in table 5.15 we see that the schedules withthree cores have a low utilization and will probably have a speedup close to that of the twocore schedules. This is due to the fact that a FMU must be scheduled to a specific core, withfour FMUs and three schedules we must have two FMUs on the same core.

This is confirmed when looking at the execution times in table 5.16. MCP and HLFET performequivalently, and both get close to optimal performance when scheduled to two cores. Onthree cores we get almost the same speedup as for two cores. This is because in this scheduletwo of the four race cars are scheduled to the same core. On four cores we still get a goodspeedup, but we do not get anywhere near as close to optimal performance as with two cores.This is a sign that the synchronization overhead is large, to get closer to a speedup of four wewould have to have less communication and/or FMUs that have longer execution times.

Balanced Car

Looking at the weights for the Balanced Car co-simulation model in table 5.9 we see that itis quite well balanced. Both wheel FMUs have quite similar weights. We also note that theedge weights are a lot larger than the other weights. Looking at the utilization in table 5.18we have very high numbers for two and three cores. The schedules for four cores are quite abit lower, this is probably due to that the experiment FMU has low weights compared to theother FMUs.

41

Page 52: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

6.2. Method

Looking at the execution times in table 5.19 and the speedup in figure 5.3 we get very good re-sults for two and three cores. For four cores we get the same speedup as for three, this followswell with the utilization data. However, one interesting point is the fact that HLFET performsa lot worse than MCP on three cores even though their utilization is almost the same. This isprobably because HLFET still scheduled two of the more heavy FMUs to the same core andlet the experiment have its own core. This shows one of the weakness of this approach, thefact that we pin a FMU to a specific core according to where its first operation gets scheduledto. An improvement could be to have a preprocessing step before the scheduling algorithm.This step could try to estimate the weight of each FMU and divide them onto the availablecores. This would be similar to the well known bin packing problem.

6.2 Method

Static scheduling can provide good speedups for coupled Co-Simulation FMUs. It is howeververy dependent on the model. For the best possible speedup, the FMUs in the aggregateneeds to be divided onto the fixed number of cores in a balanced manner. This is because welock each FMU to a specific core due to restriction 3.1.

This method showed both worse and better speedups compared to the similar approach dis-cussed in the related works chapter. However, it is very difficult to compare this approach totheirs since we do not have access to their test model.

It is very difficult to estimate the weights and a profiler should be used for the best per-formance. It would be interesting to see how well a dynamic scheduling approach couldperform. It would add some overhead during the execution but it would be able to measurethe execution times better than a profiler. It is also worth noting that the manual profilingused in this thesis did not measure execution times per which call sequences was used. Sincea operation in the initialization phase can differ from the same operation in the initializedstate it might be possible to get slightly better weights.

One of the biggest weakness of this thesis is the lack of good test models. It would have beeninteresting to test the implementation on more models. All the models in this thesis are quitesimilar, it would have been better if we for instance had models with more than just fourcoupled FMUs. The four race cars models is also very unrealistic, however it is a good test inthe sense that it shows the best possible speedups.

6.3 The work in a wider context

There are no direct ethical or societal impacts of this thesis. It could be argued that it has in-direct impacts, for example, fast simulations can help engineers being able to perform moretests on complex models which in turn might improve the end product. Improving end prod-ucts can have huge consequences, for instance more efficient engines with less pollution.However, this argument is very far fetched. Another more realistic but less impactful argu-ment is that faster simulations will save power consumption and might reduce how muchcomputer hardware that is necessary.

42

Page 53: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

7

Conclusion

This chapter concludes this thesis. It contains a summary and some future work.

7.1 Summary

An implementation of static scheduling with two different heuristics have been implementedand tested with three different models. This implementation has shown large speedups forsome co-simulation models. It is however due to the restriction 3.1 very dependent on themodel used. For the best possible performance it should be possible to divide the FMUs ontothe cores in a balanced manner. It is also important that each step have a long execution time,otherwise the overhead might become too large.

With a model exchange aggregate there was no observed speedups. This was however onlytested on one model and can therefore no be considered conclusive. It might be possible toachieve small speedups if the Set and Get operations on the FMUs have long execution times.

Both heuristics performed similarly, HLFET performed slightly better on the most of the co-simulation schedules. This combined with the shorter runtimes makes HLFET the preferredchoice. It did however perform very poorly for the Balanced Car test model with three cores.A cluster scheduling algorithm that uses the zero edging concept might perform better thanHLFET and MCP. This is due to the fact that the edge weights are very large.

7.2 Future Work

This section presents future work.

Other Heuristics It would be interesting to see how other heuristics performs compared toHLFET and MCP. The fact that the edge weights usually are very large compared tothe other weights suggests that a heuristic that uses zero edging might perform well.One example would be dynamic critical path (DCP), in the benchmark by Kwok andAhmad it was the best performing algorithm[14]. It could also be interesting to testmore advanced algorithms, such as genetic or randomization algorithms.

Dynamic Scheduling It would be interesting to compare how the static scheduling performscompared to a dynamic scheduling implementation. The fact that the weights are sodifficult to estimate might make a dynamic scheduling implementation perform betterthan static scheduling.

43

Page 54: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

Bibliography

[1] Torsten Blochwitz, Martin Otter, Johan Akesson, Martin Arnold, Christoph Clauss, Hild-ing Elmqvist, Markus Friedrich, Andreas Junghanns, Jakob Mauss, Dietmar Neumerkel,et al. Functional mockup interface 2.0: The standard for tool independent exchange ofsimulation models. In Proceedings of the 9th International MODELICA Conference; Septem-ber 3-5; 2012; Munich; Germany, number 076, pages 173–184. Linköping University Elec-tronic Press, 2012.

[2] Functional mockup interface. http://fmi-standard.org/, 2018. Accessed: 2018-02-22.

[3] Ralf Kübler and Werner Schiehlen. Two methods of simulator coupling. Mathematicaland computer modelling of dynamical systems, 6(2):93–113, 2000.

[4] Christian Andersson. Methods and tools for co-simulation of dynamic systems with thefunctional mock-up interface. Doctoral Theses in Mathematical Sciences, 2016.

[5] Jens Bastian, Christop Clauß, Susann Wolf, and Peter Schneider. Master for co-simulation using fmi. In Proceedings of the 8th International Modelica Conference; March20th-22nd; Technical Univeristy; Dresden; Germany, number 63, pages 115–120. LinköpingUniversity Electronic Press, 2011.

[6] Functional Mock-up Interface for Model Exchange and Co-Simulation. Modelica Association,July 2014. Version 2.0.

[7] Gene M Amdahl. Validity of the single processor approach to achieving large scale com-puting capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference,pages 483–485. ACM, 1967.

[8] OpenMP Architecture Review Board. OpenMP application program interface version4.5, 2015. URL https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf.

[9] Michael R Garey and David S Johnson. Computers and intractability, volume 29. w.hfreeman New York, 2002.

[10] Yu-Kwong Kwok and Ishfaq Ahmad. Static scheduling algorithms for allocating di-rected task graphs to multiprocessors. ACM Computing Surveys (CSUR), 31(4):406–471,1999.

[11] Oliver Sinnen. Task scheduling for parallel systems, volume 60. John Wiley & Sons, 2007.

44

Page 55: Parallelization of Aggregated FMUs using Static Scheduling1258495/FULLTEXT01.pdfThis thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several

Bibliography

[12] Jing-Jang Hwang, Yuan-Chieh Chow, Frank D Anger, and Chung-Yee Lee. Schedulingprecedence graphs in systems with interprocessor communication times. SIAM Journalon Computing, 18(2):244–257, 1989.

[13] M-Y Wu and Daniel D Gajski. Hypertool: A programming aid for message-passingsystems. IEEE transactions on parallel and distributed systems, 1(3):330–343, 1990.

[14] Yu-Kwong Kwok and Ishfaq Ahmad. Benchmarking and comparison of the task graphscheduling algorithms. Journal of Parallel and Distributed Computing, 59(3):381–422, 1999.

[15] Abir Ben Khaled, Mongi Ben Gaid, Nicolas Pernet, and Daniel Simon. Fast multi-coreco-simulation of cyber-physical systems: Application to internal combustion engines.Simulation Modelling Practice and Theory, 47:79–91, 2014.

[16] Thierry Grandpierre and Yves Sorel. From algorithm and architecture specificationsto automatic generation of distributed real-time executives: a seamless flow of graphstransformations. In Formal Methods and Models for Co-Design, 2003. MEMOCODE’03.Proceedings. First ACM and IEEE International Conference on, pages 123–132. IEEE, 2003.

[17] Salah Eddine Saidi, Nicolas Pernet, Yves Sorel, and Abir Ben Khaled. Acceleration offmu co-simulation on multi-core architectures. In The First Japanese Modelica Conferences,May 23-24, Tokyo, Japan, number 124, pages 106–112. Linköping University ElectronicPress, 2016.

[18] Salah Eddine Saidi, Nicolas Pernet, and Yves Sorel. Automatic parallelization of multi-rate fmi-based co-simulation on multi-core. In Proceedings of the Symposium on Theory ofModeling & Simulation, page 5. Society for Computer Simulation International, 2017.

[19] Labinot Polisi. Initialization algorithms for coupled dynamic systems. Master’s Theses inMathematical Sciences, 2017.

45