Concurrency Analysis and Transformation { An...

42
TECHNISCHE UNIVERSITÄT MÜNCHEN FAKULTÄT FÜR INFORMATIK Software & Systems Engineering Prof. Dr. Dr. h.c. Manfred Broy SPES 2020 Deliverable 1.3.B-7 Concurrency Analysis and Transformation – An Overview – Author: Wolfgang Schwitzer Version: 1.3 Date: April 11, 2011 Status: Released Technische Universit¨ at M¨ unchen - Fakult¨ at f¨ ur Informatik - Boltzmannstr. 3 - 85748 Garching

Transcript of Concurrency Analysis and Transformation { An...

Page 1: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

TECHNISCHE UNIVERSITÄT MÜNCHENFAKULTÄT FÜR INFORMATIK

Software & Systems EngineeringProf. Dr. Dr. h.c. Manfred Broy

SPES 2020 Deliverable 1.3.B-7

Concurrency Analysis and Transformation– An Overview –

Author: Wolfgang SchwitzerVersion: 1.3Date: April 11, 2011Status: Released

Technische Universitat Munchen - Fakultat fur Informatik - Boltzmannstr. 3 - 85748 Garching

Page 2: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

Version History

Version 0.1, Draft, 12.10.2010

Schwitzer: Initial structure and contents.

Version 0.2, Draft, 22.10.2010

Schwitzer: Included feedback from Prof. Broy.

Schwitzer: First version of the “Analysis” section.

Version 0.3, Draft, 14.12.2010

Schwitzer: Introduced “Software Engineering Questions concerning Multicores”.

Version 1.0, Draft, 05.01.2011

Schwitzer: First version for review by SPES partners.

Version 1.1, Draft, 06.01.2011

Schwitzer: Consistent use of term “overlapping schedule”.

Version 1.2, Draft, 12.01.2011

Schwitzer: Changed conclusions on “retiming”.

Version 1.3, Reviewed, 11.04.2011

Schwitzer: Included reviewer comments.

2

Page 3: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

ABSTRACT. This document gives an overview of concurrency-related analysisand transformation techniques. These techniques can be applied to support softwareengineering for embedded systems with parallel processors. Since a couple of years,there is an ongoing paradigm shift from singlecore towards multicore processors. Em-ploying multicore architectures for embedded systems brings in several advantages,but poses new challenges for software engineering. This document focuses on softwareengineering aspects related to programming embedded systems with parallel proces-sors. First, related work and and the scope of this document are discussed. Iterativedata-flow is presented as the model of computation, which is used throughout thispaper. The main part of this document comprises analysis, metrics, and transforma-tion of these data-flow models. A tool integration of these techniques is illustrated.The document concludes with a brief overview of ongoing research and future work.

Acknowledgements. The author thanks Manfred Broy, Martin Feilkas and To-bias Schule for their helpful comments and fruitful discussions about conceptual andtechnical topics during the writing of this document. Additional thanks go to Se-bastian Voss for introducing the author into his outstanding work on SMT-basedscheduling methods. Florian Holzl and Vlad Popa have the author’s fullest respectfor investing so much of their time and excellent programming skills in the toolsAutoFocus and Cadmos that are used to evaluate and discuss many of the topicscovered in this document. This work was mainly funded by the German FederalMinistry of Education and Research (BMBF), grant SPES2020, 01IS08045A.

3

Page 4: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

Contents

1 Introduction 5

2 Related Work and Scope 92.1 Iterative Data-Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Causality and Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Concurrency Analysis 153.1 Concurrency, Parallelism and Precedence . . . . . . . . . . . . . . . . . . . . . 153.2 Weakly Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Strongly Connected Components and Cycles . . . . . . . . . . . . . . . . . . . . 183.4 Iteration Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Delay Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Concurrency Metrics 234.1 Software-concerned metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Spatial concurrency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1.2 Temporal concurrency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1.3 Data-parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Hardware-concerned metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.1 Speed-up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.2 Efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.3 Utilization of resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Deployment-concerned metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.1 Frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.2 Reactiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.3 Jitter-robustness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Concurrency Transformation 295.1 Unfolding-Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Retiming-Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3 Look-Ahead-Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Tool-Integration 37

7 Conclusion and Future Work 39

References 40

4

Page 5: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

1 Introduction

Motivation. Since a couple of years, there is an ongoing paradigm shift in the silicon industryfrom singlecore towards multicore processors. Multicore processors are already used widespreadin desktop- and server-systems. Many next generation embedded systems are likely goingto be built upon multicore processors, too. Employing multicore architectures for embeddedsystems brings in several advantages. For example, the overall number of embedded controllersper system can be reduced and parallel computing performance is increased. Still, moderate(electrical) power consumption can be maintained. However, the leap into the multicore eraposes novel challenges for several disciplines, e.g. for electrical engineering, semiconductorproduction processes, education and software engineering [ABC+06].

This document focuses on the software engineering aspects related to programming embeddedsystems with parallel processors. Engineering software for distributed embedded systems is acomplex and challenging task [Bro06b]. Software engineering efforts are going to become evenmore tackling by introducing platforms that offer highly parallel processing capabilities. Inparticular, it is important to be aware of an application’s concurrency throughout the softwaredevelopment process. Concurrent parts of applications are deployed onto concrete multicore-based hardware and finally yield parallelism in application execution.

Awareness of an application’s concurrency and adequate deployment with respect toplatform-parallelism are key factors to leverage the potential of parallel architectures.

Software engineering questions concerning parallel embedded software systems. Sometypical questions arise along the software engineering process for parallel architectures. Indistributed systems, concurrency and parallelism are found on different levels of granularity[Bod95]. This document emphasizes software engineering questions that arise when coarse-grain software concurrency is deployed onto coarse-grain hardware parallelism. The techni-cal architecture [TRS+10] of a distributed embedded system comprises coarse-grain parallelstructures like controller-networks connected by gateways, embedded control units (ECU’s)connected by field buses, CPU’s and input/output-controllers connected by on-board buses,and processor-cores connected by on-chip interlink buses. The software architecture [Bro06a]of embedded systems is commonly described in terms of coarse-grain concurrent structureslike independent or coupled applications, tasks and subtasks communicating by channels (e.g.pipes and shared variables). This software has to be deployed on the hardware. Hence, stake-holders (e.g. software- and hardware-engineers, managers) might ask the following questionswhen developing a distributed and highly parallel embedded software system:

Q1 How does an adequate parallel hardware platform for a given a concurrent software archi-tecture look like?

Q2 How does an adequate concurrent software architecture for a given parallel technical ar-chitecture look like?

Q3 Given a software architecture and a technical architecture, how does an adequate deploy-ment of concurrent software components on parallel hardware components look like?

5

Page 6: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

In these questions (Q1-Q3), the meaning of the term “adequate” strongly depends on thedesign goals for the overall system. Thus, in a highly reactive system, adequate can mean“with shortest possible response times”. In a safety critical system, adequate can mean “withhighest possible robustness against timing jitter on buses”. In a system produced for highvolume markets, adequate can mean “with lowest possible cost of hardware units”.

Metrics for parallel embedded software systems. Unfortunately, for complex real worldsystems there usually does not exist such a distinct definition of the term “adequate”. Rather,there are several competing design goals, which require trade-offs to be made. Hence, it isimportant to have a set of system metrics at hand, that assist in answering questions Q1-Q3in a comprehensive way. The following is a summary of metrics discussed in section 4 of thisdocument. These metrics answer questions either about the software, or about the hardwareor about the deployment of the system:

Software-concerned metrics:

Spatial concurrency of applications

Temporal concurrency of applications

Data-parallelism achievable by stateless parts of applications

Hardware-concerned metrics:

Speed-up gained by investing in parallel processing power

Efficiency (average utilization of parallel processing capabilities)

Quantity and absolute utilization of resources (cores, buses, etc.)

Deployment-concerned metrics:

Frequencies of the system (software-side and hardware-side)

Response times (end-to-end delays from sensors to actuators)

Robustness against timing jitter on buses and distributed cores

Different levels of detail along the development process. Most of these metrics can beexpressed on different levels of detail. In early phases of development, there is usually lessdetail available. In late phases of development, details about the software architecture, thesoftware implementation and the hardware platform are at hand. Depending on how muchinformation about the concrete software architecture and technical architecture is available,metrics can be expressed within the following levels of detail: uniform computation time,arbitrary integer computation times, arbitrary real computation times. These levels of detailhave also been used by Xu and Parnas [XP93] for classifying scheduling algorithms.

In uniform computation time (also referred to as unit-time) each software task is consideredto consume one unit of time for processing. In particular, this means that the time consumedfor processing a task is independent from the implementation of the task. Hence, metrics inuniform computation time can already be retrieved in early stages of the development pro-cess. In early stages, a first software architecture (a decomposition into applications, tasks and

6

Page 7: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

channels) may be available, though the implementation can be incomplete. Additionally, aconcrete hardware setup (number of processors, buses etc.) needs not to be known to performanalyses in uniform time. Rather, a parallel random access machine (PRAM [FW78]) is cho-sen as hardware platform, providing an unlimited number of processors and communicationbandwidth. Communication latencies between processors are not considered in uniform time,that is, communication is considered to come at zero cost. In summary, uniform computationtime analysis draws the picture of an ideally parallelizable system, which can appear signifi-cantly better than what is achievable with the actual software implementation and hardwareplatform.

At the next level of detail, uniform computation time is refined to arbitrary integer computationtimes. Here, first estimations of processing times and communication times do exist. Processingand communication times are interpreted relative to each other. For example, a task A takesone unit of time to process, a task B takes two units of time to process, and any communicationover a channel from A to B takes three units of time. Again, a PRAM is chosen as hardwareplatform. In contrast to uniform time, here communication comes at the given relative cost.Metrics gained in arbitrary integer computation times give a refined picture of the concurrencyand parallelism quality of the system under design. The more detail on implementation andpotential hardware is used for estimation, the closer can the metrics reflect the properties ofthe final system.

Arbitrary real computation times analysis is the most detailed level. Here, the detailed softwareimplementation and the hardware platform is known and, in consequence, the worst caseexecution time (WCET) for each task and communication operation is available. If a taskcan be scheduled on several different processors, the task’s WCET on each of these processorsis available. If a communication operation can be scheduled on several different buses, thecommunication operation’s WCET on each of these buses is available. Metrics gained on thelevel of real computation times resemble the deployed system’s concurrency and parallelismproperties as close as possible.

Presentation Metrics Matlab/Simulink

AutoFocus

C/C++ Code

Analysis

Transformation 0

20

40

60

80

100

Option 1 Option 2 Option 3

Metric A

Metric B

Metric C

Simulation

Code-Generation

Deployment

Contents of this document

Figure 1: The contents of this document: presentation, analysis, metrics and transforma-tion of multicore-based embedded software systems.

Goal and contents of this document. The goal of this document is to provide a (non-exhaustive) overview of concurrency analysis and transformation techniques that support thesoftware engineering process for multicore-based embedded software systems. Basically all

7

Page 8: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

of the techniques discussed are well-known and have been thoroughly studied by others, sothat this document merely gives an overview and outlines how these isolated methods can becombined.

Figure 1 illustrates the contents of this document, which is presentation, analysis, metrics andtransformation of concurrent applications. First, applications are presented by appropriatemodels that already expose concurrency present in these applications. In the context of thisdocument, data-flow programs (and their respective data-flow graphs) are employed for thispurpose. Second, these data-flow programs are analyzed with graph analysis techniques toretrieve “raw properties” regarding concurrency. Third, these “raw properties” are put intorelation with each other in order to gain metrics regarding concurrency or parallelism. Fourth,the models can be transformed to achieve improved concurrency and parallelism if the qualitymetrics do not satisfy the design goals of the system.

Outline. This document is structured as follows. Section 2 discusses related work and sets thescope for this document. Iterative data-flow is presented as the model of computation, whichis used throughout this paper, some constraints on scheduling are set and the strong alignmentbetween causality and data-flow is reviewed. Fundamental functional composition operatorsare introduced. The main part of this document comprises sections 3, 4, and 5. Analysistechniques on iterative data-flow models are presented, metrics are derived that are used forassessing the quality of concurrency in those models and three well-studied transformationtechniques are illustrated: namely unfolding, retiming and look-ahead. A tool integration ofthese techniques is illustrated in section 6. The document concludes with section 7 giving anoutline of ongoing research and future work.

8

Page 9: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

2 Related Work and Scope

This section discusses related work and sets the scope and constraints for the following sec-tions of this document. First, iterative data-flow is presented as the model of computation,which is used throughout this paper to describe concurrent programs. Second, constraintsand assumptions on scheduling techniques within this document are given. Third, the strongalignment between the concepts of causality and data-flow is reviewed and a translation fromcausal systems to data-flow systems is sketched. Finally, fundamental functional compositionoperators related to concurrent and sequential evaluation of functions are introduced.

2.1 Iterative Data-Flow

History, application and scope. This section briefly introduces iterative data-flow as a modelto describe concurrent programs. The analyses and transformations presented throughout thisdocument refer to iterative data-flow models. In a constructive approach, data-flow modelsrepresent a visual coordination language [Lee06] for implementing concurrent programs. In amore analytic approach, these models can be derived from other sources, e.g. models fromMATLAB/Simulink [Mat], AutoFocus [Aut] or from C/C++ -like source code [Sch09]. Ingeneral, data-flow models are well-known to expose application concurrency ([Rei68], [Den80],[DK82], [LM87b], [LM87a], [PM91]) and go back to the work of Karp and Miller [KM66] whoshowed determinacy of data-flow systems. A good introduction to iterative data-flow can befound in [PM91] and concepts of general data-flow modeling are illustrated well in [DK82].

Program 1:

for (t = 1 to ) {

z(t) = 0.5 (x(t) + y(t))

}

(a)

+

x(t)

z(t) ⋅0.5

y(t)

z

x

y

(b)

Figure 2: Iterative data-flow. (a) A simple nonterminating program, assigning the arith-metic mean of x(t) and y(t) to z(t) for each given t. (b) A data-flow graphthat represents a nonterminating program with infinite input series x(t) and y(t),output series z(t), and two tasks. The dashed circles x, y and z are interfacepoints to the environment, that is x and y are environment inputs and z is anenvironment output.

9

Page 10: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

Embedded systems programming. Iterative data-flow offers several features that are desir-able for embedded systems programming. Deadlock and boundedness can be statically ana-lyzed. Capacities of FIFO queues on sender- and receiver-side are finite and can be staticallyderived as well. The execution order can be statically scheduled, i.e. at compile time, whichminimizes runtime scheduling overhead or, as Lee and Messerschmitt put it, “most of thesynchronization overhead evaporates” [LM87a]. Static scheduling can be important for safetycritical and hard real-time embedded systems, where certification processes may require theuse of static techniques.

Data-flow programs and data-flow graphs. Within the context of this document, we refer todata-flow programs are iterative and nonterminating. This reflects the intension to use theseprograms for embedded software systems, which execute the same tasks repeatedly (iterativenature) and it is not known a priori, when the system is going to terminate (nonterminatingnature). Data-flow programs are visualized by data-flow graphs (DFG’s). In data-flow graphs,vertices represent tasks and edges represent directed communication between tasks. Accordingto Lee and Messerschmitt [LM87a], a data-flow program is called synchronous if the number ofmessages transmitted per unit of time is known at compile time. Furthermore, a synchronousdata-flow (SDF) program is called homogeneous if the number of messages transmitted perunit of time is equal for all edges or heterogeneous otherwise.

A special case of homogeneous SDF is called iterative data-flow, which is restricted to a ho-mogeneous constant rate of ”1“ message transmitted per unit of time on each of the edges.An iterative data-flow program processes inputs and produces outputs by executing all tasksrepeatedly. Inputs from and outputs of these programs are infinite time series. During a singleiteration, each task executes exactly once, consuming one token from each of its input channels,and producing one token on each of its output channels.

An example of a simple iterative nonterminating program that operates over infinite time seriesis given in figure 2(a). Any execution of the for-loop (over t) in the nonterminating programcorresponds to one iteration. The execution time of an iteration is called the iteration periodof the program. In the example program 2(a), the iteration period is the time required toexecute one addition operation plus one multiplication operation.

Figure 2(b) shows a data-flow graph that corresponds to this program, where one task repre-sents the (2, 1)-ary addition operation and one task represents the (1, 1)-ary constant multipli-cation by a factor of 1

2 . The infinite input and output series x(t), y(t) and z(t) are representedby edges of the graph. In the iterative data-flow model of computation, each task consumes asingle token from each of its incoming series and produces a single token on each of its outputseries. Synonyms for “token” are “sample” and “message”. The “iteration period” is alsocalled “sampling period” with the reciprocal being the “iteration rate” or “sampling rate”.

Note, that in general the concept of tasks in data-flow programs is not limited to basic arith-metic operations as in figure 2(b). In general, a task may describe functions of arbitrarycomplexity that can be nonlinear and time-varying, e.g., conditional constructs (if-then-else),state-automatons, or complete sub-tasks, specified by data-flow programs themselves. How-ever, in the context of this document we treat the tasks of a DFG as ”black-boxes“. Here,these black-boxes perform (n,m)-ary functions and we do not discuss topics related to internalstructuring or hierarchical composition of tasks as architectures.

10

Page 11: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

A B C D D

(a)

A B C

D

(b)

Figure 3: Delays, pipelining and feedback in DFG’s. Environment inputs and outputs areintentionally hidden to increase readability. (a) A DFG with tasks A, B, and Cand two distinct delay operators D, forming a three-stage parallel pipeline. (b)A DFG with tasks A, B, and C located on a feedback cycle with one delay.

Delays, pipelining, feedback and parallelism. Many algorithms require tokens to be delayed,so that these tokens can be used in future iterations. For this purpose, data-flow graphs allowfor specification of delay operators. A software implementation of a delay operator could bea variable or a FIFO queue. Delays are typically called “registers” or “latches” in electricalengineering literature. Figure 3(a) shows a DFG with three tasks A, B, and C and two distinctdelay operators D. Inputs from and outputs to the environment are intentionally not shownfor better readability.

The system in figure 3(a) can be executed in three logical units of time if tasks are scheduledsequentially as A;B;C. Delays in a data-flow system can be leveraged to reduce the totalexecution timespan, since they introduce temporal concurrency and allow for pipelining par-allelism on a parallel platform. A good introduction to pipelining in microprocessors can befound in [LS11], p. 190 and following pages. Scheduling the tasks A, B, and C for executionas parallel pipeline stages, each on an individual processor, can reduce the timespan requiredto execute all three tasks. An example is given in section 4.1.2. In the context of iterativedata-flow, the timespan required to execute all of the tasks once is called the iteration period.Delays and pipelining can be leveraged to increase the iteration rate of data-flow programs,the same way, pipelining is used in VLSI systems to increase the frequency of microprocessors,for example.

As explained in more detail in sections 3.3 and 3.4, feedback cycles can pose a lower limit onthe iteration period, called the iteration bound. The iteration bound can severely limit theachievable parallelism. Figure 3(b) shows a DFG with three tasks A, B, and C located on afeedback cycle that has one delay operator.

2.2 Scheduling

Within this document, scheduling is concerned with the construction of non-preemptive, staticschedules. Preemption is not allowed, that is, once a task it started, it runs to completion and

11

Page 12: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

cannot be interrupted by other tasks. Schedules are constructed statically at compile time,to be more precise, the schedules are fully-static. Cyclo-static scheduling is not covered inthe context of this document. Non-overlapping as well as overlapping schedules are presented,the former being associated with a single iteration, the latter being associated with multiplesubsequent iterations in order to exploit inter-iteration parallelism.

Schedules are concerned with the timing of schedulable units. Schedulable units are tasks andcommunication operations among those tasks. Schedules can be given in logical units of time orin physical time. In logical time, each schedulable unit takes n ≥ 1 units of time for execution,in physical time, each schedulable unit has a deterministic worst-case execution time (WCET),known at compile time.

2.3 Causality and Timing

Closely related to the concepts of iterative data-flow (see Sec. 2.1) are the concepts of causalityand timing. A straightforward translation scheme from causal models to iterative data-flowmodels is sketched here. A more detailed discussion on this topic is out of the scope of thisdocument. Once a causal model is translated, the techniques presented in this document can be

Fw

i1 i2

in

o1 o2

om

Fw

i1(t) i2(t)

in(t)

o1(t) o2(t)

om(t)

(a)

Fs

i1 i2

in

o1 o2

om

Fs

i1(t) i2(t)

in(t)

o1(t) o2(t)

om(t)

D

D

D

(b)

Figure 4: Translation schemata from weakly and strongly causal components to iterativedata-flow graphs. (a) Translation of a weakly causal component Fw. (b) Trans-lation of a strongly causal component Fs.

applied to it, too. Hence, it is possible to interpret analysis and transformation of concurrencyfrom a causality-centered point of view. For example, the CASE-Tool AutoFocus [Aut],presented later in section 6, supports modeling of embedded systems software with causalcomponents and a notion of global logical time. AutoFocus is based on a theory of stream-processing functions [BS01] that divides logical time into discrete intervals, so-called ticks.Here, components roughly correspond to tasks, and causality to delay operators. A global tickcorresponds to one iteration of a data-flow program.

Figure 4 sketches translation schemata for weakly and strongly causal components. Figure 4(a)illustrates how a weakly causal component with behavior function Fw, syntactic input interfaceI = {i1, . . . , in}, and syntactic output interface O = {o1, . . . , om} translates to a task that im-plements function Fw and has input series i1(t), . . . , in(t), and output series o1(t), . . . , om(t).

12

Page 13: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

Figure 4(b) shows how a strongly causal component with behavior function Fs, syntacticinput interface I = {i1, . . . , in}, and syntactic output interface O = {o1, . . . , om} translatesto a task that implements function Fs and has input series i1(t), . . . , in(t), and output serieso1(t), . . . , om(t). Each of the output series has a dedicated single unit-delay operator D. Theoutput unit-delay operators satisfy the strongly causal nature of Fs: the outputs of Fs at thenext point of time t+ 1 depend causally on the inputs to Fs at the current point of time t.

In general, a k-causal component, k ≥ 0, demands the outputs at the k-next point of timet+ k to depend causally on the inputs at the current point of time t. Consequently, a k-causalcomponent requires kD unit-delays on each output.

2.4 Composition

Functions are composed by three basic composition operators: sequential, concurrent andrecurrent composition, as illustrated in figure 5. Functional composition is fundamental for thetechniques presented in the following sections. A more detailed discussion on this topic is foundin [BDD+92] and [Bro95], for example. The following paragraphs introduce the compositionoperators, where a task with n inputs and m outputs is called an (n,m)-ary function.

f∘g

f g

(a)

f || g

f

g

(b)

µf

f

(c)

Figure 5: Forms of composition: (a) sequential, (b) concurrent and (c) recurrent.

Sequential composition. The two functions f and g have to be evaluated in sequence, sincethe outputs of f are the inputs of g. Let f be an (n,m)-ary function and let g be and (m, o)-aryfunction. Then f ◦ g is the (n, o)-ary function defined by

(f ◦ g)(x1, . . . , xn) = g(f(x1, . . . , xn)) .

Concurrent composition. The two functions f and g can be evaluated independently fromeach other. Let f be an (n,m)-ary function and let g be and (o, p)-ary function. Then f‖g isthe (n+ o,m+ p)-ary function defined by

(f‖g)(x1, . . . , xn+o) = (f(x1, . . . , xn), g(xn+1, . . . , xn+o)) .

13

Page 14: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

Recurrent composition. The function f feeds one of its inputs from one of its outputs, thusf is defined recursively. Let f be an (n,m)-ary function where n > 0. Then µf is the(n − 1,m)-ary function such that the value of (µf)(x1, . . . , xn−1) is the (least) fixed point ofthe equation

(y1, . . . , ym) = f(x1, . . . , xn−1, ym) .

The µ-operator used above feeds back the m-th output channel of a (n,m)-ary function withn > 0.

Note, that using recurrent composition introduces the difficult problem of solving fixed pointequations. Moreover, for a general recurrent function a fixed point may not even exist, thusmaking this function non-computable. It has been shown that delayed recurrent structures areguaranteed to have a fixed point (see [BS01] for an overview). In the context of this document,we define computable recurrent structures to have at least one unit-delay on each recurrentpath.

14

Page 15: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

3 Concurrency Analysis

This section presents a number of concurrency analysis techniques that can be performedon iterative data-flow graphs (DFG’s, see Sec. 2.1). More formally, an iterative data-flowgraph G is defined as tuple G = {V,E}, with V being a set of vertices and E being a setof directed edges E = {(v1, v2) : v1, v2 ∈ V }. Here, the vertices V represent tasks and theedges E represent directed communication channels. Since multiple edges can exist betweentwo vertices, G belongs to the class of directed multigraphs. The following paragraphs definesome useful terms for analyzing concurrency in data-flow graphs.

Definition 3.1. The precedence relation . ≺ . ⊆ V × V defines whether a constraint on theorder of execution of two vertices exists. If v1 ≺ v2 (read “v1 precedes v2”) holds for twovertices v1, v2 ∈ V , then v1 has to be scheduled before v2.

Definition 3.2. The concurrency relation .‖. ⊆ V ×V defines whether there does not exist aconstraint on the order of execution of two vertices. It complements the precedence relation.If v1 ‖ v2 (read “v1 is concurrent to v2”) holds for two vertices v1, v2 ∈ V , then both v1 andv2 can be arbitrarily scheduled with respect to each other. Precedence cannot be claimed ineither direction, that is ¬(v1 ≺ v2) ∧ ¬(v2 ≺ v1).

Definition 3.3. The execution time function τ(V )→ R returns the execution time associatedwith a vertex v ∈ V . Note, that execution times in unit-time are always defined as τ(v) = 1.

Definition 3.4. The delay function δ(E) → N returns the number of unit-delays associatedwith an edge e ∈ E.

3.1 Concurrency, Parallelism and Precedence

Concurrency and parallelism. Concurrency occurs where no precedence can be claimed (seedefinitions 3.1 and 3.2). Two concurrent tasks v1 and v2 can be arbitrarily scheduled withrespect to each other: task v1 can be scheduled any time after task v2 and vice versa. If v1and v2 are scheduled on different processors at an overlapping period of time, the concurrencyof v1 and v2 has been leveraged to produce parallelism (in time). In a sense, concurrencyreflects the absence of observable casual dependency, while parallelism reflects the coincidentexecution at runtime. Hence, concurrency of tasks is a prerequisite for parallelization on theruntime platform.

Intra-iteration precedence. Intra-iteration precedence is concerned with the precedence con-straints that exist within a single iteration of a data-flow program. The first step in intra-iteration precedence analysis of a given data-flow graph G is to construct its acyclic precedencegraph

APG(G) = G/ED

by removing the set of edges with at least one unit-delay

ED = {e ∈ E : δ(e) ≥ 1}

15

Page 16: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

A

B

D

C D

(a)

A B

C D

(b)

A1 B1

C1 D1

C2

D2 A2 B2

(c)

Figure 6: Precedence in iterative data-flow. (a) A simple iterative DFG with 4 tasks A, B,C and D. (b) Corresponding intra-iteration precedence graph. (c) Correspondingintra- and inter-iteration precedence graph for 2 consecutive iterations.

from the original graph G. Only edges with zero delay affect computations in the currentiteration. Edges with k unit-delays affect computations, which are k iterations in the future,hence, delayed edges do not affect intra-iteration precedence. Note, that APG(G)) is guaran-teed to be acyclic: by definition, every cycle in the original graph G must have at least oneedge with a unit-delay greater than zero to be computable (see section 2.4). Thus, each cycleC is guaranteed to be “broken up” in APG(G)) by removing at least that one edge e ∈ C withδ(e) ≥ 1.

The second step is to include tuples in the precedence relation “≺” (see definition 3.1). For eachvertex v ∈ V get the set of transitively reachable successors S. For each s ∈ S, add “v ≺ s” tothe precedence relation. In other words, each vertex precedes its transitive successors in theacyclic precedence graph.

Example 3.1.1 (Intra-iteration precedence). Figure 6(b) shows an acyclic precedence graphof the DFG in figure 6(a). In this example, the precedence relation consists of {(A ≺ B), (A ≺D), (B ≺ D), (C ≺ D)}. Furthermore, the concurrency relation consist of {(A‖C), (B‖C)}.Within one iteration, C can be scheduled independently from A and B as long as A is scheduledbefore B, B before D and C before D. Here, two processors are sufficient to fully parallelizethe data-flow program.

Inter-iteration precedence. Inter-iteration precedence describes the precedence constraintsthat exist between consecutive iterations of a data-flow program. The following explains, howinter-iteration precedence between two tasks A and B is analyzed. As a first step, let Ai and Bibe the execution of task A and task B in the i-th iteration of the data-flow program. Further,let k be the smallest sum of delays on any path that leads from A to B in the original DFG.More formally, k is defined by

k = min {DP }where the minimum is taken over all paths P that lead from A to B, and DP is the sum ofdelays on path P

DP =∑

e∈(P∩E)

δ(e) .

16

Page 17: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

The second step is to include “Ai ≺ Bi+k” in the inter-iteration precedence relation. Note, thatin the case of self-loops or cycles (see section 3.3), a task can particularly show inter-iterationprecedence with itself.

Example 3.1.2 (Inter-iteration precedence). Figure 6(c) shows an acyclic precedence graphof the DFG in figure 6(a) for two consecutive iterations. Ai, Bi, Ci and Di represent theexecution of the tasks A, B, C and D in the i-th iteration. Intra- and inter-iteration precedenceis shown. In fact, “A1 ≺ C2” is the single inter-iteration precedence constraint that links thesubgraphs of both iterations. This precedence graph for two consecutive iterations offers moreconcurrency than the intra-iteration precedence graph for a single iteration (figure 6(b)) does.In a combined schedule for both iterations, four processors can be used to fully parallelize thedata-flow program. Three processors are sufficient, if the execution time of C is less than orequal to that of A, that is τ(C) ≤ τ(A).

3.2 Weakly Connected Components

Weakly connected components (WCC’s) in data-flow graphs represent pairwise concurrentprocesses in an intuitive way. In general, any directed graph G = {V,E} can be uniquelydecomposed into n weakly connected components W1, . . . ,Wn. The WCC’s are disjoint sub-graphs of G, that is G =

⋃ni=1Wi and Wi ∩ Wj = ∅ for all i, j ∈ {1, . . . , n} where i 6= j.

Furthermore, the WCC’s are pairwise disconnected. Here, disconnected means that theredoes not exist any edge (v, u) leading from any vertex v ∈ Wi to any vertex u ∈ Wj for alli, j ∈ {1, . . . , n} where i 6= j. The vertices v and u represent tasks. Since the tasks v and

Figure 7: Weakly connected components analysis of a DFG with 20 tasks in total. The 6distinct components are highlighted by colored areas. There largest componentcomprises 9 tasks, 3 components are comprising 3 tasks each, and 2 componentsconsist of a single task each.

17

Page 18: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

u reside in distinct WCC’s that are not connected, there does not exist communication viachannels among these tasks. Hence, no precedence between these tasks can be claimed, thatis ¬(v ≺ u) ∧ ¬(u ≺ v)⇔ v ‖ u.

Now we extend this idea from single tasks v and u to complete weak component clusters Wi

and Wj where i 6= j. It can be concluded that Wi and Wj are pairwise concurrent and can bescheduled in parallel, that is Wi ‖Wj . Note, that precedence between the tasks inside a singleweakly connected component still exists. The class of concurrency, introduced by weaklyconnected components is also called spatial concurrency. This emphasizes the topologicalaspect of this topic. Thus, a DFG with n weakly connected components offers at least n-foldspatial concurrency.

Example 3.2.1 (Analysis of weakly connected components). Figure 7 illustrates a weaklyconnected components analysis of a DFG with a total of 20 tasks. In this example, 6 distinctWCC’s can be identified, which are highlighted by colored areas. The largest componentcomprises 9 tasks, there are 3 components that comprise 3 tasks each, and there are 2 trivialcomponents that consist of a single task each. Hence, by scheduling each of the 6 WCC’s on adedicated processor, 6-fold parallelism can be achieved in this example. One the one hand, nocostly inter-processor communication (IPC) is required by this schedule. On the other hand,speed-up and efficiency of this schedule heavily depends on the actual execution times of the20 tasks.

3.3 Strongly Connected Components and Cycles

Figure 8: Strongly connected components and cycles analysis of a DFG with 10 tasks intotal. There exists 1 strongly connected component, which is shown with high-lighted edges. This strongly connected component comprises 6 tasks, belongingto 3 minimal cycles. The 3 minimal cycles are highlighted by colored areas.

18

Page 19: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

Detecting feedback cycles is important in concurrency analysis, since cycles hinder paralleliza-tion in three ways:

Cycles reduce data parallelism (see Sec. 4.1.3).

Cycles affect the iteration bound (see Sec. 3.4).

Cycles restrict transformations in modifying the number of delay operators (see Sec. 5.2).

Feedback occurs in data-flow graphs inside so-called feedback cycles. On a cycle C = {v1, . . . , vn}with length l = |C| each vertex v ∈ C can transitively reach any other vertex u ∈ C in a maxi-mum of l steps. This definition of cycles is closely related to the definition of strongly connectedcomponents (SCC’s) in directed graphs. If a path from a vertex v to a vertex u exists in aSCC, this implies that there also exists a path from u to v. Hence, the vertices v and uand, furthermore, all vertices on the paths between v and u are strongly connected. SCC’sin directed graphs can be efficiently detected, e.g. by using Tarjan’s algorithm [Tar72] withruntime complexity O(|V |+ |E|).

Once data tokens enter a feedback cycle, these tokens (or the effects caused by them) canpossibly circulate inside this cycle for an infinite amount of iterations. Hence, feedback cyclesintroduce so-called states in a system. Any task inside a feedback cycle becomes stateful, evenif the separate task (viewed in isolation) is stateless. Stateful tasks do not lend themselves forcoarse-grain data parallelization [GTA06]. Furthermore, the iteration bound is likely to raise,if feedback cycles with a low number of unit-delays are present (see Sec. 3.4). In feedbacksystems, modifying delays changes the functional behavior or even destroys causality.

Example 3.3.1 (Analysis of strongly connected components and cycles). Figure 8 shows thestrongly connected components and cycles of a DFG with 10 tasks in total. There exists 1strongly connected component, which is shown with highlighted edges. This strongly connectedcomponent comprises 6 tasks, belonging to 3 minimal cycles. These 3 minimal cycles arehighlighted by colored areas.

3.4 Iteration Bound

The time it takes to complete the execution of one iteration of a data-flow program is referredto as the iteration period. Can we expect that adding more processors will always lead toa shortened iteration period due to increased parallelism? In theory, it is possible to reducethe iteration period of any feed-forward system towards zero by adding more processors (see[Par89a], for example). Unfortunately, any data-flow program with feedback cycles has aninherent iteration bound, which is a lower bound on the achievable iteration period. If theiteration period of a system equals its iteration bound, the system is called rate-optimal. It isnot possible to construct a schedule with an iteration period lower than the iteration bound,regardless of the number of parallel processors available. The notion of a lower bound onthe iteration period in feedback systems was discovered in the late 1960’s by Reiter [Rei68] asmaximum cycle ratio π with maximum computation rate 1

π for periodic admissible schedules.

Parhi showed that rate-optimal schedules for static data-flow programs can always be con-structed [PM91]. This is important for two reasons: on the one hand, the maximum achievableparallelism in a feedback system can be determined statically at compile time, one the other

19

Page 20: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

hand, more platform parallelism could be exploited as long as the program is not yet rate-optimal. Nevertheless, even if a vast number of processors is available, an iteration period lessthan the iteration bound cannot be achieved. Note, that in some cases, selective rewriting ofprogram behavior (see section 5.3) may allow for lowering iteration periods below the originaliteration bound.

In any data-flow program with feedback cycles the iteration bound is given by

T∞ = max

{TCDC

}where the maximum is taken over all cycles C ⊆ G in the data-flow graph, and TC is the sumof execution times of vertices in cycle C,

TC =∑

v∈(C∩V )

τ(v)

and DC is the sum of unit-delays on edges in cycle C

DC =∑

e∈(C∩E)

δ(e) .

Note, that DC > 0 always holds in any computable cycle C, which must have at least oneunit-delay by definition. Any cycle C for which TC/DC = T∞ is referred to as critical cycle ofthe data-flow program.

(a)

B3 P1

P2

P3

A1

B1 A2

B2

C2

A3

C3 P4 C1 t

0 2 4 6 8

(b)

Figure 9: A simple DFG and a corresponding rate-optimal schedule. (a) DFG of a data-flow program with with three tasks: τ(A) = 1, τ(B) = 5 and τ(C) = 2. (b)A corresponding rate-optimal schedule of three successive iterations on fourprocessors P1, . . . , P4 with T∞ = 2.

Example 3.4.1. The example shown in figure 9 illustrates the idea of iteration boundedness.The iteration bound of a simple DFG with three vertices A, B and C and feedback cycles iscalculated. A rate-optimal combined schedule of three successive iterations of this DFG onfour processors is given. The example in 9(a) show a simple DFG of a data-flow program withthree tasks. Individual execution times of the tasks are given as τ(A) = 1, τ(B) = 5 and

20

Page 21: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

τ(C) = 2. Two cycles exist in DFG 9(a): Cycle1 between A and B and a self-loop Cycle2 onC. The iteration bound is derived as

T∞ = max

{τ(A) + τ(B)

δ((A,B)) + δ((B,A)),

τ(C)

δ((C,C))

}= max

{1 + 5

2 + 1,2

1

}= 2

units of time. Figure 9(b) shows a four-processor schedule of three successive iterations. Thefour processors P1, . . . , P4 are arranged on the vertical axis and time is displayed on the hori-zontal axis. A1, A2 and A3 refer to the execution of task A in the 1st, 2nd and 3rd iteration.The same applies to Bi and Ci, i ∈ {1, 2, 3}. The total period of the schedule is 6 units of timeand the iteration period is 6/3 = 2 units of time, since 3 iterations are executed by this schedulewithin 6 units of time. Hence, the schedule given in figure 9(b) is already rate-optimal as itsiteration period is equal to its iteration bound. Neither by adding more successive iterations,nor by adding processors can a schedule with shorter iteration period than 2 units of time beconstructed.

3.5 Delay Profiles

Embedded software systems commonly have the property of being reactive systems. Reac-tiveness means that the system steadily communicates and interacts with its environment: it“reacts” on input data from the environment by producing output data for the environmentwithin a given period of time. In embedded systems, input data are read from physical sensorsand output data are written to physical actuators.

(a) (b)

Figure 10: Delay profile with guaranteed delays from inputs to outputs. (a) DFG with 2inputs and 3 outputs. (b) Corresponding delay profile, showing for each inputits reachable outputs and respective guaranteed delay.

Each environment input of a given data-flow program can affect a set of environment outputs.This set of affected outputs Y ⊂ V is determined by transitive search beginning at the inputx ∈ V . For example, this can be achieved by depth-first-search (DFS), within the data-flowgraph. Now, we know that any output to the environment y ∈ Y can depend (causally) on theinput from the environment x. Furthermore, we know that at least one path P ⊆ G from x toy exists. Consequently, we can derive a profile of guaranteed delays between stimuli on x andreactions on any of the y ∈ Y . According to the delay calculus of Broy [Bro10], the guaranteeddelay between two vertices x and y is given by

δgar(x, y) = min{DP }

21

Page 22: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

where the minimum is taken over all paths P ⊆ G from x to y, and DP is the sum of unit-delayson edges along the path P

DP =∑

e∈(P∩E)

δ(e)

if such a path P exists, or δgar(x, y) =∞, otherwise. In other words, δgar(x, y) = d means thatit takes at least d iterations before stimuli on sensor x can lead to observable effects on actuatory. In particular, δgar(x, y) =∞ means that stimuli on sensor x never lead to observable effectson actuator y.

Example 3.5.1 (Delay profiles). Figure 10(b) shows a delay profile of the simple DFG infigure 10(a), created by the toolkit Cadmos (see section 6). Note, that the term “ticks” issynonymous to “iterations” (see also section 2.3). Input A affects output C after ≥ 4 andoutput D after ≥ 1 iterations. Input E affects output F after ≥ 2 iterations. Percentages showthe distribution of input-/output-delays, relative to the highest delay (≥ 4 in this case).

Why are delay profiles interesting for concurrency and parallelization? Delay profiles canbe used to increase parallelism in multi-iteration schedules and can be applied to determinejitter-stability of distributed systems. The following two paragraphs summarize how this canbe achieved.

Increasing parallelism in multi-iteration schedules. Guaranteed delays pose a lower boundon the observable input-/output-latencies of the system under design. In a “black-box” view,we observe that the system never reacts on input before some guaranteed amount of time haspassed. Internally, the actual execution of tasks can be delayed for several iterations, as longas the guaranteed delays are satisfied. This offers additional freedom in scheduling for parallelsystems. In multi-iteration schedules, there can be “overpopulated” points in time with moreconcurrent tasks than processors, and there can be “sparse” points in time with less concurrenttasks than processors. Using delay profiles, tasks can be moved from overpopulated to sparsepoints in the schedule in order to increase parallelism.

Jitter-stability in distributed systems. Jitter in time is a severe problem in distributed real-time systems. The deviation in transmission time of messages over buses is an example of jitter.Usually, this jitter is some ε that is specific to a given bus system. In this case, jitter can betaken into account in advance by using transmission times t± ε instead of t. More difficult tohandle is “sporadic” jitter that may delay messages in the order of several iterations. Sporadicjitter can be caused by electromagnetic interference on the bus system’s physical layer, forexample. The idea is to use delay profiles to determine how many iterations a message canbe delayed by jitter, without affecting the guaranteed input-/output latencies of the system.Employing delays for jitter-stability is ongoing research and needs to be discussed in moredetail in companion documents.

22

Page 23: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

4 Concurrency Metrics

In the preceding sections 2 and 3, we have shown how applications for embedded systems arepresented as iterative data-flow programs and introduced some basic analysis techniques. Thissection gives an overview of some useful concurrency metrics that can be derived from iterativedata-flow models with the help of those analysis techniques. Along the software engineeringprocess, these metrics support answering concurrency- and parallelism-related questions con-cerning mainly the software, hardware and deployment of a system. Typical questions for asystem under design have been outlined in the introduction section 1: how does an adequateconcurrent software architecture, adequate parallel hardware platform, or adequate distributeddeployment look like?

Discussion of metrics is arranged in this section as follows. Three different main areas of metricsare discussed: software-, hardware- and deployment-concerned. Each area is presented in oneof the following subsections. Each single metric is explained in a dedicated sub-subsection andorganized by three topics: purpose, analysis and calculation of the respective metric.

4.1 Software-concerned metrics

This subsection explains three mainly software-concerned metrics: available spatial concur-rency, available temporal concurrency and available data-parallelism.

4.1.1 Spatial concurrency.

Purpose. Available spatial concurrency reflects the pairwise non-communicating parts of anapplication. These parts neither intercommunicate within a single iteration nor across sub-sequent iterations. A program with n spatially concurrent parts offers at least n-fold paral-lelism. The term spatial refers to the fact that parts are always concurrent to each other,regardless of time (or iterations). Each of the n spatially concurrent parts can be scheduledon a dedicated processor, producing n-fold parallelism in total. One the one hand, no costlyinter-processor communication (IPC) is required by exploiting spatial concurrency. On theother hand, speed-up and efficiency heavily depends on the actual execution times of the tasksinside the concurrent parts.

Analysis. Weakly connected components analysis from section 3.2 is employed. The analysisfunction WCC(G) returns the n disjoint WCC’s of a data-flow graph G.

Calculation. The spatial concurrency Cσ ∈ N of a data-flow program with correspondingdata-flow graph G is defined by

Cσ(G) := ‖WCC(G)‖ .

23

Page 24: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

4.1.2 Temporal concurrency.

Purpose. Available temporal concurrency reflects parts of an application that only commu-nicate across iterations, but never within a single iteration. A program with n temporallyconcurrent parts offers at least n-fold parallelism. This kind of parallelism is also referred toas pipelining parallelism. Each of the n temporally concurrent parts can be scheduled as apipeline stage on a dedicated processor, producing n-fold parallelism in total. One the onehand, the iteration period can be significantly reduced by pipelining. On the other hand,costly inter-processor communication (IPC) is required by exploiting temporal concurrency.The different pipeline stages reside on different processors and the stages have to communicatefrom one iteration to the next.

Analysis. Acyclic precedence graph analysis from section 3.1 and weakly connected compo-nents analysis from section 3.2 are employed. Additionally, the spatial concurrency Cσ (seesection 4.1.1, above) is required. The analysis function APG(G) returns the acyclic precedencegraph of a data-flow graph G. The analysis function WCC(G) returns the n disjoint WCC’sof a data-flow graph G.

Calculation. The temporal concurrency Cτ ∈ N of a data-flow program with correspondingdata-flow graph G is defined by

Cτ (G) := ‖WCC(APG(G))‖ − Cσ(G) + 1 .

4.1.3 Data-parallelism.

Purpose. Data-parallelism can be employed for any part (subsystem) of an application thatdoes not depend on its own history of executions, i.e. these are the stateless parts of anapplication. Additionally, the type of data processed by data-parallel subsystems is requiredto be a complex type like a list, array or matrix. The basic idea is to execute the same operationsimultaneously on disjoint parts of the data. This concept is analogous to the “single instructionmultiple data” concept (SIMD, see [Fly72]) in processor architecture.

For a stateless subsystem S ⊆ G, n parallel instances S1, . . . , Sn can be added to a schedule fora single iteration. Note, that n is virtually only limited by the number of available processors.The complex input data token T in is split into n parts T in1 , . . . , T inn , each dispatched to oneof the instances Si, i ∈ {1, . . . , n} that run in parallel. After all Si have finished processing,the resulting n output tokens T out1 , . . . , T outn are merged to one output token T out. One theone hand, data-parallelism of n can achieve significant speed-up near n for large input datastructures, e.g. found in audio- and image-processing. On the other hand, the additional timerequired for the split and the merge operations may outweigh the reduction in time by data-parallel execution. An introduction to the nature and use of data-parallelism can be found in[GTA06], for example.

Analysis. Strongly connected components analysis from section 3.3 is employed. The analysisfunction SCC(G) returns the strongly connected components (SCC’s) of a data-flow graph G.

24

Page 25: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

Calculation. The subgraph that is stateful and, thus, cannot be used for data-parallelismStateful(G) ⊆ G of a data-flow program with corresponding data-flow graph G is defined by

Stateful(G) :=⋃

SCC(G) ,

which is the union of all strongly connected components of G. The subgraph that is statelessand, thus, can be used for data-parallelization Stateless(G) ⊆ G is defined by

Stateless(G) := G/Stateful(G) .

The number of potentially data-parallel tasks Pδ ∈ N of a data-flow program with correspond-ing data-flow graph G = {V,E} is defined by

Pδ(G) := ‖V ∩ Stateless(G)‖ ,

which is the number of vertices in the stateless subgraph of G.

4.2 Hardware-concerned metrics

This subsection introduces three mainly hardware-concerned metrics: speed-up, efficiency andutilization.

4.2.1 Speed-up.

Purpose. An application can be executed sequentially by a single processor or in parallelby p processors. The speed-up S ∈ R reflects, how many times faster a given application isexecuted by the parallel processors than this application is executed by the single processor. Inthe context of this document, we are concerned with iterative and non-terminating programs.Hence, speed-up is measured with respect to the iteration period, which is the time to executeall tasks (and communication operations) once. Note, that the achievable speed-up is limited bywhat is also referred to as “Amdahl’s law” [Amd67]: the amount of sequential tasks (comparewith “sequential composition”, section 2.4) severely limits the speed-up achievable by addingmore parallel processors.

Analysis. The iteration period of the sequential reference system is Tseq and the iterationperiod of the parallel system with p processors is Tpar. After constructing a schedule, theiteration period (Tseq or Tpar) is set to the duration of the longest schedule appearing for anyof the processors or buses. In the context of uniform computation time or arbitrary integercomputation time (see section 1), schedules can be efficiently constructed by Hu-level methods[Hu61], for example. For arbitrary real computation time analysis, e.g. A* methods (see[HNR68] and [PLM99]), or solver based methods (see [Gre68] and [Vos10]) can be used.

Calculation. The speed-up of a parallel system with iteration period Tpar compared to asequential system with iteration period Tsqe is defined by

S :=TseqTpar

.

25

Page 26: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

4.2.2 Efficiency.

Purpose. Efficiency is closely related to speed-up (see section 4.2.1, above). The efficiencyE ∈ [0, . . . , 1] of a parallel system reflects the average utilization of parallel processing capabil-ities. In a system with high efficiency (near one), all parallel processing capabilities are utilizedwell. A system with low efficiency (near zero) is likely to have higher energy consumption andhardware costs than necessary. In this case, e.g. lowering frequencies or removing least uti-lized processors (see section 4.2.3, below) can lead to increased efficiency with lowered energyconsumption and hardware costs. Possibly, there are parts of the system that are deliberatelyredundant, e.g. for safety reasons. Usually, these parts reduce the overall efficiency, but cannotbe removed, of course.

Analysis. Speed-up S (see section 4.2.1) and the number of parallel processors p is required.Efficiency can be calculated for uniform computation time, arbitrary integer computation timeor real computation time schedules (compare to speed-up in section 4.2.1).

Calculation. The efficiency of a parallel system with speed-up S and number of parallelprocessors p is defined by

E :=S

p.

4.2.3 Utilization of resources.

Purpose. Utilization of resources (processors and buses) reflects the amount of time a resourceis actually used during the execution of an iteration. The utilization U(r) ∈ [0, . . . , 1] iscalculated for a given resource r ∈ R, with R being the hardware resources of the system.Resources with high utilization (near one) have little reserves in the case of unforeseen eventsthat affect execution times or transmission times. Resources with low utilization (near zero)are likely to add unnecessary energy consumption or hardware cost to the system.

Analysis. The iteration period of the parallel system with resources R is Tpar. After con-structing a parallel schedule, each resource r ∈ R is assigned a sequence of scheduled tasksactivations. In the case of buses, these tasks are communication operations. The sum of exe-cution times of these tasks is Tσ(r) ∈ R, with 0 ≤ Tσ(r) ≤ Tpar. Utilization can be calculatedfor uniform computation time, arbitrary integer computation time or real computation timeschedules (compare to speed-up in section 4.2.1 and efficiency in section 4.2.2).

Calculation. The utilization of a scheduled resource (processors or bus) r in a parallel systemwith iteration period Tpar is defined by

U(r) :=TσTpar

with Tσ(r) being the sum of execution times of tasks scheduled on resource r.

26

Page 27: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

4.3 Deployment-concerned metrics

This subsection briefly discusses three mainly deployment-concerned metrics: frequency, reac-tiveness and jitter-robustness.

4.3.1 Frequency.

Purpose. The deployed system runs with a certain frequency (or iteration rate), which is thereciprocal of the iteration period. Typical software-side techniques to increase the frequency ofparallel systems are pipelining (see section 4.1.2) and unfolding (see section 5.1). On hardware-side, fast bus-systems with little communication delay (for distributed IPC) can be selectedto enable higher frequencies. One the one hand, higher frequencies yield higher reactivenessin embedded software systems. One the other hand, higher frequencies can reduce robustnessagainst timing jitter in distributed systems.

Analysis. After constructing a parallel schedule, the iteration period Tpar is known. Addi-tionally, the iteration bound T∞ (see section 3.4) shows that Tpar ≥ T∞ holds. If Tpar = T∞ thedeployed system is rate-optimal. Hence, neither by unfolding and retiming transformations,nor by adding more processors, can a system with lower iteration period than T∞ be build.Frequency can be calculated in all three granularities of time: uniform, arbitrary integer andreal computation time.

Calculation. The iteration rate of a system with iteration period Tpar is defined by

1

Tpar.

4.3.2 Reactiveness.

Purpose. Reactiveness is often an important quality metric for embedded software systems.It reflects the end-to-end response times from sensors to actuators of the deployed system.Here, reactiveness is defined on the basis of inputs and outputs of the system. Reactiveness isthe amount of time that it takes at least before a stimulus on an input can produce observableeffects on an output. In the context of this document, we are concerned with the lower boundsof reactiveness.

It is beyond the scope of this document, to analyze whether a stimulus can actually ever affectan output. Further, it is not considered, how long it takes at most before a stimulus on aninput affects an output. For both analyses (“actually ever” and “at most”) it is not sufficientto solely consider a DFG’s structure. Rather, analysis of a DFG’s behavior, e.g. by model-checking, is necessary to retrieve this information regarding the upper bounds of reactiveness.This is ongoing research and needs to be discussed in companion documents.

27

Page 28: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

Analysis. We use the delay profiles from section 3.5 to get for each input x the set of affectedoutputs Y and the guaranteed delays δgar(x, y) for all y ∈ Y . After constructing a parallelschedule, the iteration period Tpar is known. Response times can be calculated in all threegranularities of time: uniform, arbitrary integer and real computation time.

Calculation. The guaranteed response time ρgar ∈ R of a stimulus on an input x on an outputy in a system with iteration period Tpar is defined as

ρgar(x, y) := Tpar · δgar(x, y) .

For a data-flow program with DFG G the guaranteed response time is given by

ρgar(G) := Tpar ·min{δgar(x, y)} .

where the minimum is taken over all guaranteed delays from any input x to any output yin G. The deployed system is guaranteed to respond not faster than ρgar(G) on any inputstimulus.

4.3.3 Jitter-robustness.

Purpose. As mentioned in section 3.5, jitter in time is a problem that occurs in distributedrealtime systems. An example for jitter is the deviation in transmission times of messages overbuses. We define jitter-robustness as the maximum amount of jitter that cannot break thesystem’s expected input-/output-behavior. For reasons of simplicity, jitter is expected to be ofpositive value only, i.e. communication happens at t+ ε where ε ≥ 0 is the jitter and t is theexact time without jitter. It is ongoing research, how transformation techniques like retimingcan be applied to maximize jitter-robustness of deployed systems.

Analysis. Jitter-robustness can be expressed in iterations or in time. To calculate the time,the iteration period Tpar has to be known. In a DFG G = {V,E}, each edge e = (v1, v2)with v1, v2 ∈ V and e ∈ E has a jitter-robustness proportional to the edge’s unit-delay δ(e).Messages sent by v1 over e are required by v2 after δ(e) iterations at the latest. If e istransmitted over a bus, then a jitter of up to the duration of δ(e) iterations on that bus istolerated. This idea can be extended from single edges to multiple edges between two verticesv1 and v2, as explained in the next paragraph.

Calculation. Let H ⊆ E be the set of all edges from v1 to v2 and from v2 to v1. If H isempty, then the jitter-robustness for v1 and v2 is undefined, otherwise continue as follows.Jitter-robustness in iterations is the minimum delay between the vertices v1 and v2, definedas

JitterRobustnessIterations(v1, v2) := min{δ(h)}

where the minimum is taken over all unit-delays of edges h ∈ H. Jitter-robustness in time isdefined as

JitterRobustnessT ime(v1, v2) := Tpar · JitterRobustnessIterations(v1, v2) .

28

Page 29: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

5 Concurrency Transformation

This sections gives an overview of behavior invariant transformation techniques that can beapplied to iterative data-flow models. The models are transformed in order to influence theirconcurrency properties and the parallelism they offer. The metrics discussed in section 4 may beused to monitor this influence. In the context of this paper, a behavior invariant transformationis considered to leave the original input-output behavior unaltered, though additional latencymay be introduced. Here, three transformation techniques are presented: unfolding, retimingand look-ahead. Each of these techniques is used in typical areas of application. A moredetailed discussion on this topic can be found in [Par89a] for example.

5.1 Unfolding-Transformation

Features of the unfolding transformation. Unfolding transformation aims at increasing par-allelism by constructing combined schedules for multiple successive iterations of a data-flowprogram. The main parameter to this transformation is the unfolding factor J . In the re-sulting schedules combine J iterations, thus, inter-iteration parallelism can be exploited. Byincreasing parallelism, unfolding allows for reducing the iteration period if sufficient parallelprocessors are available. In pure feed-forward data-flow programs, there is virtually no limit forreducing the iteration period by unfolding. Nevertheless, in data-flow programs with feedback,the iteration bound (see section 3.4) poses a lower limit on the achievable iteration period. Aprofound introduction to unfolding transformation is found in [PM91].

(a) (b)

Figure 11: Unfolding transformation of a feed-forward system. (a) Original DFG with 3tasks. (b) Unfolded DFG with factor J = 3.

Example 5.1.1 (Unfolding transformation of a feed-forward system). Figure 11(a) shows asimple feed-forward system with three tasks: SensorProcessing, ControlAlgorithm and Actua-torProcessing. Similar kinds of structures are likely to be found in embedded systems that exe-cute control functions. The execution times of the tasks are given as τ(SensorProcessing) = 1,τ(ControlAlgorithm) = 4 and τ(ActuatorProcessing) = 1. Hence, a one-processor schedule hasan iteration period of 6. By using more processors and pipelining parallelism, the iterationperiod can be reduced to 4 (the execution time of ControlAlgorithm).

29

Page 30: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

(a)

C2 P1

P2

P3

S1

A2

A3

S3 C1 t

C3 S2 A1

0 2 4 6 8

(b)

Figure 12: Precedence graph and schedule of an unfolded feed-forward system. (a) Prece-dence graph of the unfolded DFG in figure 11(b). (b) A three-processor scheduleof the unfolded DFG in figure 11(b).

By unfolding transformation, the iteration period of this system can be reduced below 4. Fig-ure 11(b) shows the unfolded DFG with unfolding factor J = 3. The unfolded DFG comprises9 tasks, while SensorProcessingi, ControlAlgorithmi and ActuatorProcessingi represent the exe-cution of the respective task in the i-th iteration (i ∈ {1, 2, 3}). Note, that unfolding preservedthe sum of delays (2D).

Figure 12(a) shows the acyclic precedence graph of the unfolded DFG in figure 11(b). Thecritical path (SensorProcessing1 ≺ ControlAlgorithm2 ≺ ActuatorProcessing3) of the three-unfolded system requires 6 units of time. Figure 12(b) illustrates a three-processor schedule ofthe unfolded DFG in figure 11(b). Task names are abbreviated as follows for better readabil-ity: SensorProcessing (S), ControlAlgorithm (C) and ActuatorProcessing (A). This schedulesatisfies the precedence constraints from 12(a) and has a total duration of 6. Since the scheduleexecutes 3 iterations of the original DFG, the iteration period 2.

(a) (b)

Figure 13: Unfolding transformation of a feedback system. (a) Original DFG with 3 tasksand a feedback cycle. (b) Unfolded DFG with factor J = 2.

30

Page 31: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

Example 5.1.2 (Unfolding transformation of a feedback system). Now we continue and ex-tend example 5.1.1 by introducing a feedback channel. Figure 13(a) shows a feedback systemwith three tasks: SensorProcessing, ControlAlgorithm and ActuatorProcessing. Additionallythere is a feedback channel with a unit-delay from ActuatorProcessing to ControlAlgorithm.The execution times of the tasks are equal to those in example 5.1.1: τ(SensorProcessing) = 1,τ(ControlAlgorithm) = 4 and τ(ActuatorProcessing) = 1. Consequently, a one-processor sched-ule still has an iteration period of 6. Again, task names are abbreviated in the following forbetter readability: SensorProcessing (S), ControlAlgorithm (C) and ActuatorProcessing (A).The iteration bound (see section 3.4) of this system with one feedback cycle is

τ(C) + τ(A)

δ((C,A) + δ((A,C))=

4 + 1

1 + 1= 2.5 .

Thus, by unfolding and using more processors, the iteration period can be reduced to 2.5 (theiteration bound).

(a)

C2 P1

P2

P3

S1

A2 C1

t

S2 A1

0 2 4 6 8

(b)

Figure 14: Precedence graph and schedule of an unfolded feedback system. (a) Precedencegraph of the unfolded DFG in figure 13(b). (b) A rate-optimal two-processorschedule of the unfolded DFG in figure 13(b).

Figure 13(b) shows the unfolded DFG with unfolding factor J = 2. The unfolded DFG com-prises 6 tasks, while SensorProcessingi, ControlAlgorithmi and ActuatorProcessingi representthe execution of the respective task in the i-th iteration (i ∈ {1, 2}). Note, that unfoldingpreserved the sum of delays (3D).

Figure 14(a) shows the acyclic precedence graph of the unfolded DFG in figure 13(b). Figure14(b) illustrates a two-processor schedule of the unfolded DFG in figure 13(b). The schedulesatisfies all precedence constraints from 14(a) and has a total duration of 5. Since the scheduleexecutes 2 iterations of the original DFG, the iteration period 2.5. The iteration period of2.5 equals the iteration bound, thus, this schedule is rate-optimal. Unfolding with J ≥ 2 ispossible, but does not yield a lower iteration period than 2.5. Note, that the given three-processor schedule has a speed-up of S = 6/2.5 = 2.4 and an efficiency of E = 2.4/3 = 0.8 (seesection 4.2). It is possible to construct a two-processor schedule with an iteration period of 3that has full efficiency of E = 1.0, but less speed-up of S = 2.0.

31

Page 32: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

C2 S1

A2

A3

S3 C1

C3 S2 A1

C2 P1

P2

P3

S1

A2

A3

S3 C1

t

C3 S2 A1

0 2 4 6 8 10 12 14 16

Figure 15: Periodic overlapping three-processor schedule for the unfolded DFG of example5.1.1. By overlapped execution, the iteration period is 2.

Overlapping schedules and periodic task execution. Note, that after unfolding, schedulescan have deviations in the periodic execution of tasks. For example, the schedule given inexample 5.1.1 figure 12(b) activates task S at t = 0, 1, 5, 6, 7, 11, . . .. This can be problematicfor systems, that need strict periodic activation of tasks, e.g. S needs to read data from ananalog-to-digital converter (ADC) at t = 0, 2, 4, 6, 8, 10, . . .. To overcome this problem, over-lapping schedules can be used in combination with unfolding. Overlapped execution refersto the fact, that one processor can start executing the next iteration, while other proces-sors are still executing the current iteration. Figure 15 shows a strictly periodic overlappingthree-processor schedule for the unfolded DFG of example 5.1.1. This schedule satisfies theprecedence constraints. By overlapped execution, the iteration period remains 2. Moreover,each of the original tasks (S, C and A) is activated with a period of 2.

History, application and scope. A detailed description of the algorithm behind unfoldingtransformation is given in [PM91]. An important property of the unfolding transformation isthat it preserves the number of delays of the original data-flow program. Hence, no additionallatency is introduced, which is particularly important for reactive systems. Both examples(5.1.1 and 5.1.2) illustrate, that the sum of delays in the unfolded system equals the sum ofdelays in the original system. Depending on the effects of additional inter-processor commu-nication (IPC), an unfolded program can execute at an significantly higher iteration rate thanthe original program does. In general, data-flow programs with large-grain tasks (e.g. complexsub-programs inside a task) and fine-grain tasks (e.g. simple linear functions) can profit fromunfolding.

32

Page 33: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

5.2 Retiming-Transformation

Features of the retiming transformation. Retiming transformation aims at increasing par-allelism by changing precedence constraints in data-flow programs. The term retiming refersto the fact that delays are moved around in a DFG, thus, the “timing” of the DFG is altered.Delays are moved around in such a way that the total number of delays in a cycle (see section3.3) of the program remains unchanged. Changing the number of delays affects precedence(see section 3.1). The iteration period is reduced if the altered precedence constraints allow forparallel schedules with less duration. A typical local retiming transformation is the removalof n unit-delays from each of the incoming edges of a vertex v, and addition of n unit-delaysto each of the outgoing edges of v. This local retiming transformation can be applied to avertex if all of its incoming edges have at least one unit-delay associated with them. Any globalretiming transformation can be described by a combination of local retiming transformations.A comprehensive description of retiming transformation is given in [LS91].

(a) (b)

B

P1

P2

A

D

C

t

0 2 4 6 8

(c)

Figure 16: A data-flow program before retiming. (a) Original DFG with 4 tasks 2 unit-delays and once cycle. (b) Precedence graph. (c) Two-processor schedule withiteration period 6.

Example 5.2.1 (Retiming transformation). Figure 16(a) shows the DFG of a data-flow pro-gram with 4 tasks, 2 unit-delays in total, and once cycle. The precedence graph of this programis given in 16(b). Execution times of the tasks are given as τ(A) = 1, τ(B) = 2, τ(C) = 4and τ(D) = 4. With the given precedence relations and execution times, not more than twoparallel processors can be leveraged. A two-processor schedule with an iteration period of 6 isillustrated in figure 16(c).

Now, we apply a local retiming transformation on task B by removing one unit-delay at theincoming edge and adding one unit-delay to each of the two outgoing edges. Figure 17(a)shows the retimed DFG, which now has 3 unit-delays in total. Though the total number ofdelays in the DFG changed, the number of delays in the cycle remained unchanged (2D). Thealtered precedence graph of this retimed system is given in figure 17(b). With these alteredprecedence relations, three parallel processors can be leveraged. A three-processor schedulewith a reduced iteration period of 4 is illustrated in figure 17(c).

History, application and scope. Retiming was first proposed by Leiserson, Rose and Saxe[LRS83] to increase the frequency of synchronous circuitry. As well, retiming can be usedto increase the frequency of data-flow programs that represent software. Note, that retiming

33

Page 34: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

(a) (b)

B P1

P2

P3

A

D

C

t

0 2 4 6 8

(c)

Figure 17: A retimed data-flow program that can leverage 3 parallel processors. (a) Re-timed DFG with 3 unit-delays. (b) Altered precedence graph of the retimedsystem. (c) Three-processor schedule with reduced iteration period 4.

leaves the total number of delays in cycles unchanged, hence, the iteration bound (see section3.4) remains unchanged, too. However, retiming can change the total number of delays in aDFG. For practical schedules, the additional parallelism produced by retiming also relies onpipelining. Pipelining of tasks on different processors requires inter-processor communication(IPC). Depending on the effects of this additional IPC, a retimed program can execute at anincreased iteration rate. Data-flow programs with large-grain tasks and fine-grain tasks mayprofit from retiming.

5.3 Look-Ahead-Transformation

Features of the look-ahead transformation. The look-ahead transformation aims at increas-ing parallelism in recursive systems, i.e. systems that have feedback loops. The main parameterto this transformation is the look-ahead depth L. After a look-ahead transformation togetherwith recursive doubling by depth L the transformed model performs L iterations in timeO(log2L) if at least L parallel processors are available. In this section, the basic approach oflook-ahead transformation on first order recurrence systems is outlined.

Example 5.3.1 (Look-ahead transformation of a first order recursive system). The followingexample illustrates the idea of look-ahead transformation. A first order recursive system istransformed with L = 2 and a two-processor schedule for the transformed system is given. Notethat this is a slightly modified version of a more comprehensive example found in [Par89b].Consider the following equation of a basic first order recursive system

y(t+ 1) = ay(t) + bx(t) + c . (1)

The input series to this system is x(t), the output series is y(t) and a, b and c are constants.While a and b are constant factors, c is a constant summand. For example, this basic systemcan be configured to realize a stable discrete low-pass filter section by assigning a + b = 1with 0 ≤ a ≤ 1 and 0 ≤ b ≤ 1 and using c for input offset compensation. This system is firstorder recursive, since the term y(t+ 1) is calculated depending on its preceding value y(t). Itis this recursive dependency in series y(t) that hinders efficient parallelization of system (1).Often, a recursively defined series like y(t) is also called a state of the system. The following

34

Page 35: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

b a

x(t)

D c y(t+1)

y(t)

a s

b

s+ab

Figure 18: A data-flow graph of equation 1 with input series x(t), output series y(t), con-stants a, b and c and one unit-delay D in the recursive loop. The upper rightbox explains the multiply-and-add operator used as shortcut in this graph.

paragraphs show how look-ahead shifts a state’s immediate inter-iteration dependency furtherinto the future, thus creating pipeline interleaved parallelism.

Figure 18 shows a data-flow graph of the system described by equation 1. In the following, alook-ahead transformation with L = 2 is applied in two steps: recasting and static look-aheadcomputation. First, equation 1 is recast by expressing y(t+ 2) as a function of y(t) to derive

y(t+ 2) = a [ay(t) + bx(t) + c] + bx(t+ 1) + c . (2)

Second, static look-ahead computation is applied to equation 2 finally obtaining L − 1 = 1steps of look-ahead in equation 3

y(t+ 2) = a2y(t) + abx(t) + bx(t+ 1) + ac+ c . (3)

Figure 19 shows a data-flow graph of the transformed system expressed by equation 3. Note,that the terms ac + c, b, ab and a2 are constant and can be precomputed at compile time.The transformed model exposes one step of look-ahead at its inputs, manifested in the new

b a2

x(t+1)

2D ac+c y(t+2)

y(t) ab

D x(t)

Figure 19: An equivalent first order recursive system transformed with L = 2 leading toL− 1 = 1 steps of look-ahead and two unit-delays 2D in the recursive loop.

input series x(t + 1) instead of x(t) from the original model. Finally, this model is two-wayparallelized using a pipeline interleaving approach for scheduling. Figure 20 shows a partialtwo processor schedule for the look-ahead transformed system of figure 19 using a pipelineinterleaving approach. This schedule offers two-fold parallelism and, thus, a theoretical speedupof 2. The practical speedup depends on efficient realization of the pipeline interleaving onconcrete CPU cores and the overhead imposed by the additional multiplications and delays inthe look-ahead transformed system.

35

Page 36: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

Time (t) 0, 1 2, 3 4, 5

Processor 1 State y(t)

y(-1) y(1) y(3)

Processor 2 State y(t)

y(0) y(2) y(4)

Figure 20: A partial two processor schedule for the system of figure 19 using a pipelineinterleaving approach.

Iteration period and signal processing. Look-ahead transformation is capable of reducing theiteration period below the iteration bound of the original data-flow model. This reduction isachieved by actually modifying the algorithm described by the original data-flow model, leavingthe input-output behavior unchanged. Transforming a data-flow model by a look-ahead of Lintroduces L-fold parallelism in the transformed model. Look-ahead is an interesting techniquefor creating additional parallelism in a class of data-intensive signal processing applicationsthat process large amounts of input data. Digital signal processing, e.g. filtering, usuallyapplies linear functions like additions and multiplications, which are well-suited for look-aheadtransformation. The throughput in such data-intensive applications increases significantly byusing recursive doubling along with the look-ahead transformation.

Additional latency and reactive systems. Looking ahead L iterations implies, that inputvalues for the next L iterations must be available, before the transformed data-flow modelcan produce its first output values. Thus, a look-ahead transformation by L introduces anadditional output latency of L iterations. This may yield in exceeding the maximum admis-sibly response times in reactive systems. Depending on the concrete scenario, highly reactiveembedded applications (that require short response times upon real-time input) may not profitfrom look-ahead.

History, application and scope. Look-ahead techniques were first presented by Kogge andStone [KS73] as a solution to a general class of recurrence equations. This work lead to what isalso known as recursive doubling algorithms. Later, Parhi [Par89a], [Par89b] showed how thesetechniques increase the parallelism in recursive digital filters. Parhi also describes how look-ahead can even be applied to parallelize state-automatons and other non-linear time varyingsystems. Originally this work aimed at synchronous integrated circuits design, but it is usefulfor software applications described by iterative data-flow, too.

36

Page 37: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

6 Tool-Integration

Cadmos - A concurrent architectures research toolkit. Many of the techniques presentedin this document are implemented in Cadmos, a toolkit for the Eclipse Rich Client Plat-form (RCP) [Ecl]. The intended purpose of this toolkit is research in the area of concurrentarchitectures in embedded software systems. Cadmos is developed by the chair of softwareand systems engineering of the Institut fur Informatik at the Technische Universitat Munchen(TUM). Cadmos does not offer a programming language by itself, but rather offers an in-terface to integrate with existing programming and modeling languages. At the moment ofwriting, integration with MATLAB/Simulink [Mat] is work in progress, so to allow for concur-rency analysis and transformation of industrial Simulink- and Stateflow-models. Additionally,Cadmos has extensive support for the case-tool AutoFocus [Aut].

Figure 21: Modeling with the case-tool AutoFocus supported by interactive concurrencyanalysis of Cadmos.

The Cadmos toolkit offers several analysis “views”. Some of these views visually present data-flow graphs and precedence graphs with automatic layout. The number of implemented analysistechniques presented in section 3 are incrementally completed. Other views are concernedwith the presentation of metrics as discussed in section 4. Unfolding transformation is alreadyimplemented and interactive retiming transformation and look-ahead transformation is left forfuture work. For example, in section 5.1 the figures of DFG’s and precedence graphs (figures11(a), 11(b), 12(a), 13(a), 13(b), 14(a)) are analyzed, transformed and rendered by Cadmos.

Figure 21 illustrates the integration of Cadmos with the case-tool AutoFocus. Data-flowgraph visualization with WCC- and cycle-highlighting (see sections 3.2 and 3.3) is shown inthe upper-right corner in this example. Delay profiles (see sections 3.5 and 4.3.2) are shown inthe lower-right corner in this example.

37

Page 38: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

(a) (b) (c)

Figure 22: Visualization and transformation in Cadmos. (a) Data-flow graph visualization.(b) Unfolding transformation and visualization with Kamada-Kawai layout. (c)Intra- and inter-iteration precedence graph construction and visualization.

AutoFocus- A research prototype for seamless model-based development. AutoFocus[Aut] is developed and maintained by the chair of software and systems engineering at TUM.The purpose of AutoFocus is to serve as a research prototype for integrated modeling, sim-ulation, verification and deployment of reactive embedded software systems. In AutoFocus,architecture and behavior of software for embedded systems is specified by causal componentnetworks (see section 2.3). The composition of these networks is based on timed versions of theoperators presented in section 2.4. AutoFocus’s model of computation closely resembles iter-ative data-flow (see section 2.1). Thus, an AutoFocus-model can easily be translated to an it-erative data-flow model (see section 2.3) and subsequently be analyzed by the Cadmos toolkit.As mentioned above, a similar integration in design and analysis of MATLAB/Simulink-modelsis work in progress.

38

Page 39: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

7 Conclusion and Future Work

First experiments with the tool integration (see section 6) show the practicability of the pre-sented analysis and transformation techniques (see section 3 and 5) in supporting the softwareengineering process for parallel embedded software systems. The metrics discussed in section4 give support in design and deployment decisions. Nevertheless, evaluation of practicabilityof these metrics and the introduction of further metrics is future work.

Elaborate hardware description required. The presented techniques mainly focus on theiterative data-flow models for the software-side. In future, these need to be more closelycoupled with elaborate models for the hardware-side to be practical. A sufficient descriptionof the hardware-side should comprise embedded controller networks, cores, field-buses andinterlink-buses. A promising approach in this direction is the “technical perspective” explainedin [TRS+10].

Time- and space-efficient static schedulers required. Time- and space-efficient static schedul-ing techniques are required for the construction of arbitrary real-time schedules with communi-cation latencies. For example, image the complexity of constructing a static real-time schedulefor a “complete car” with circa 1000 task to be scheduled on 100 cores, distributed over 40embedded controllers, connected by 5 field-buses. It is left for future work, to further investi-gate static scheduling with Hu-Level methods [Hu61], A* methods ([HNR68], [PLM99]), andsolver based methods ([Gre68], [Vos10]).

Integration with industry-relevant tools. The integration with more tools that are relevantin the embedded software systems industry is future work. A major step in this direction isthe integration with MATLAB/Simulink, which is work in progress.

39

Page 40: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

References

[ABC+06] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis,Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, JohnShalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallelcomputing research: A view from berkeley. Technical Report UCB/EECS-2006-183,EECS Department, University of California, Berkeley, Dec 2006.

[Amd67] Gene M. Amdahl. Validity of the single processor approach to achieving largescale computing capabilities. In AFIPS ’67 (Spring): Proc. of the April 18-20,1967, spring joint computer conference, pages 483–485, New York, NY, USA, 1967.ACM.

[Aut] AutoFocus3 Homepage. http://af3.in.tum.de/.

[BDD+92] Manfred Broy, Frank Dederich, Claus Dendorfer, Max Fuchs, Thomas Gritzner,and Rainer Weber. The design of distributed systems - an introduction to focus.Technical Report TUM-I9202, Technische Universitat Munchen, jan 1992.

[Bod95] A. Bode. Klassifikation paralleler architekturen. Parallelrechner: Architekturen-Systeme-Werkzeuge”, Leitfaden der Informatik, Teubner, Stuttgart, Germany,pages 11–40, 1995.

[Bro95] Manfred Broy. Advanced component interface specification. In Takaysau Ito andAkinori Yonezawa, editors, Theory and Practice of Parallel Programming - Inter-national Workshop TPPP’94, pages 89 – 104. Springer, 1995.

[Bro06a] M. Broy. Challenges in automotive software engineering. In ICSE ’06: Proceedingsof the 28th international conference on Software engineering, pages 33–42, NewYork, NY, USA, 2006. ACM.

[Bro06b] M. Broy. The’grand challenge’in informatics: Engineering software-intensive sys-tems. Computer, 39(10):72–80, 2006.

[Bro10] M. Broy. Relating time and causality in interactive distributed systems. EuropeanReview, 18:507–563, 2010.

[BS01] M. Broy and K. Stoelen. Specification and development of interactive systems:Focus on streams, interfaces, and refinement, 2001.

[Den80] J.B. Dennis. Data flow supercomputers. IEEE computer, 13(11):48–56, 1980.

[DK82] A. L. Davis and R. M. Keller. Data flow program graphs. Computer, 15(2):26–41,1982.

[Ecl] Eclipse Homepage - The Eclipse Foundation open source community website.http://www.eclipse.org/.

[Fly72] M.J. Flynn. Some computer organizations and their effectiveness. Computers,IEEE Transactions on, 21:948–960, 1972.

40

Page 41: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

[FW78] Steven Fortune and James Wyllie. Parallelism in random access machines. In STOC’78: Proceedings of the tenth annual ACM symposium on Theory of computing,pages 114–118, New York, NY, USA, 1978. ACM.

[Gre68] H.H. Greenberg. A branch-bound solution to the general scheduling problem. Op-erations Research, 16(2):353–361, 1968.

[GTA06] Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. SIGOPS Oper.Syst. Rev., 40(5):151–162, 2006.

[HNR68] P.E. Hart, N.J. Nilsson, and B. Raphael. A formal basis for the heuristic determina-tion of minimum cost paths. Systems Science and Cybernetics, IEEE Transactionson, 4(2):100 –107, jul. 1968.

[Hu61] T. C. Hu. Parallel sequencing and assembly line problems. Operations Research,9(6):841–848, Nov. - Dec. 1961.

[KM66] Richard M. Karp and Raymond E. Miller. Properties of a model for parallel com-putations: Determinancy, termination, queueing. SIAM Journal on Applied Math-ematics, 14(6):pp. 1390–1411, 1966.

[KS73] P.M. Kogge and H.S. Stone. A parallel algorithm for the efficient solution of a gen-eral class of recurrence equations. IEEE Transactions on Computers, 100(22):786–793, 1973.

[Lee06] Edward A. Lee. The problem with threads. Computer, 39:33–42, 2006.

[LM87a] E.A. Lee and D.G. Messerschmitt. Static scheduling of synchronous data flowprograms for digital signal processing. IEEE Transactions on Computers, 2:36,1987.

[LM87b] Edward A. Lee and David G. Messerschmitt. Synchronous data flow. Proc. of theIEEE, 75(9):1235–1245, September 1987.

[LRS83] C.E. Leiserson, F.M. Rose, and J.B. Saxe. Optimizing synchronous circuitry byretiming. In Third Caltech Conference on Very Large Scale Integration, pages 87–116. Computer Science Press, Incorporated, 1983.

[LS91] C.E. Leiserson and J.B. Saxe. Retiming synchronous circuitry. Algorithmica, 6(1):5–35, 1991.

[LS11] Edward A. Lee and Sanjit A. Seshia. Introduction to embedded systems, a cyber-physical systems approach. http://LeeSeshia.org/, 2011.

[Mat] MathWorks Homepage - MATLAB and Simulink for technical computing.http://www.mathworks.com/.

[Par89a] Keshab K. Parhi. Algorithm transformation techniques for concurrent processors.Proceedings of the IEEE, 77:1879–1895, Dec. 1989.

[Par89b] Keshab K. Parhi. Pipeline interleaving and parallelism in recursive digital filters.i. pipelining using scattered look-ahead and decomposition. IEEE Transactions onAcoustics, Speech and Signal Processing, 37:1099–1117, Jul 1989.

41

Page 42: Concurrency Analysis and Transformation { An Overviewspes2020.informatik.tu-muenchen.de/results/D-1-3-B... · production processes, education and software engineering [ABC+06]. This

[PLM99] D. Piriyakumar, Paul Levi, and C. Murthy. Optimal scheduling of iterative data-flow programs onto multiprocessors with non-negligible interprocessor communi-cation. In Peter Sloot, Marian Bubak, Alfons Hoekstra, and Bob Hertzberger,editors, High-Performance Computing and Networking, volume 1593 of LectureNotes in Computer Science, pages 732–743. Springer Berlin / Heidelberg, 1999.10.1007/BFb0100634.

[PM91] Keshab K. Parhi and David G. Messerschmitt. Static rate-optimal schedulingof iterative data-flow programs via optimum unfolding. IEEE Trans. Comput.,40(2):178–195, 1991.

[Rei68] Raymond Reiter. Scheduling parallel computations. J. ACM, 15(4):590–599, 1968.

[Sch09] T. Schuele. A Coordination Language for Programming Embedded Multi-CoreSystems. In 2009 International Conference on Parallel and Distributed Computing,Applications and Technologies, pages 201–209. IEEE, 2009.

[Tar72] Robert Endre Tarjan. Depth-first search and linear graph algorithms. SIAM J.Comput., 1(2):146–160, 1972.

[TRS+10] Judith Thyssen, Daniel Ratiu, Wolfgang Schwitzer, Alexander Harhurin, MartinFeilkas, and Eike Thaden. A system for seamless abstraction layers for model-based development of embedded software. In Software Engineering (Workshops),pages 137–148, 2010.

[Vos10] Sebastian Voss. Integrated Task and Message Scheduling in Time-TriggeredAeronautic Systems. PhD thesis, Universitat Duisburg-Essen, Fakultat furWirtschaftswissenschaften, Institut fur Informatik und Wirtschaftsinformatik,2010.

[XP93] J. Xu and D.L. Parnas. On satisfying timing constraints in hard-real-time systems.IEEE Transactions on Software Engineering, 19:70–84, 1993.

42