Programming Heterogeneous Embedded Systems …...Programming Heterogeneous Embedded Systems for IoT...

Post on 25-May-2020

33 views 1 download

Transcript of Programming Heterogeneous Embedded Systems …...Programming Heterogeneous Embedded Systems for IoT...

Programming Heterogeneous Embedded Systems for IoT

Jeronimo CastrillonChair for Compiler ConstructionTU Dresdenjeronimo.castrillon@tu-dresden.de

Get-together toward a sustainable collaboration in IoTApril 18th, Tunis, Tunisia

Internet of things

2 © J. Castrillon. Get-together for IoT, Apr. 2016

Source: Open clipart

Real time

Distributed

Resource constrained

Energy efficient

Highly heterogeneous

Multiple domains

Internet of things

3 © J. Castrillon. Get-together for IoT, Apr. 2016

Source: Open clipart

Distributed Energy efficient

Highly heterogeneous

Real time Resource constrained

Multiple domains

IoT/Systems-of-Systems: Extremely hard to program and

reason about

Independent nodes: Trends

q Single node: HW complexityq Increasing heterogeneityq Complex interconnect

q Increasing number of cores

q SW “complexity”q Not anymore a simple control loopq Need expressive programming

models

© J. Castrillon. Get-together for IoT, Apr. 20164

0 2 4 6 8

10 12 14 16

2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

Num

ber o

f PEs

Year

PE Count in SoCs

OMAP Family

OMAP1 OMAP2OMAP3530

OMAP3640OMAP4430

OMAP4470

OMAP5430

Snapdragon Family

S1 QSD8650 S2 MSM8255S3 MSM8060

S4 MSM8960

S4 APQ8064

[Castrill14]

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

Independent nodes: Trends

q Single node: HW complexityq Increasing heterogeneityq Complex interconnect

q Increasing number of cores

q SW “complexity”q Not anymore a simple control loopq Need expressive programming

models

© J. Castrillon. Get-together for IoT, Apr. 20165

0 2 4 6 8

10 12 14 16

2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

Num

ber o

f PEs

Year

PE Count in SoCs

OMAP Family

OMAP1 OMAP2OMAP3530

OMAP3640OMAP4430

OMAP4470

OMAP5430

Snapdragon Family

S1 QSD8650 S2 MSM8255S3 MSM8060

S4 MSM8960

S4 APQ8064

[Castrill14]

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

Outline

q Introduction

q Models of computation

q Tool flows

q Outlook for IoT

q Summary

© J. Castrillon. Get-together for IoT, Apr. 20166

Outline

q Introduction

q Models of computation

q Tool flows

q Outlook for IoT

q Summary

© J. Castrillon. Get-together for IoT, Apr. 20167

What is a MoC?

8 © J. Castrillon. Get-together for IoT, Apr. 2016

A model of computation is the definition of the set of allowable operations used in computation and their respective costs. It is used for measuring the complexity of an algorithm in execution time and or memory space

Wikipedia: From computational theory

Model of computations are abstract specifications of how a computation can progress

http://c2.com/cgi/wiki?ModelsOfComputat ion

A Model of computation is a collection of rules that govern the interaction of components

Ptolemaeus, C. (Ed.). System Design, Modeling, and Simulation using Ptolemy II Ptolemy.org, 2014

(Parallel) Models of Computation (MoCs): Rules

q What are the componentsq Threads, processes, actors, tasks or procedures

q Execution and concurrencyq Order and determinism

q Communication mechanisms: How do components exchange dataq Asynchronous or rendezvous, perfect or lossy

9 © J. Castrillon. Get-together for IoT, Apr. 2016

A Model of computation is a collection of rules that govern the interaction of components

Ptolemaeus, C. (Ed.). System Design, Modeling, and Simulation using Ptolemy II Ptolemy.org, 2014

Why MoCs?

q Depending on the rules, MoCs feature different propertiesq Properties

q Relate to the power: what can be computed with the modelq Makes it easier for automatic tools to understand and synthesize code

§ E.g., compile-time optimization, support resource constraints, …

q Examplesq Can the application run on bounded memory?q Can we compute the maximal throughput?

q Is the execution deterministic?

10 © J. Castrillon. Get-together for IoT, Apr. 2016

MoCs: Overview

11 © J. Castrillon. Get-together for IoT, Apr. 2016

Ptolemaeus, C. (Ed.). System Design, Modeling, and Simulation using Ptolemy II Ptolemy.org, 2014

For IoT in general: Mixture of MoCs interacting (across different domains. In this talk: Software (concurrent, dataflow + PN + time)

Homogeneous Synchronous Dataflow (HSDF)

q Graph with simple execution semantics

q Functional, graph model can be extended to modelq Bounded buffersq Schedules

12 © J. Castrillon. Get-together for IoT, Apr. 2016

a1 a2 a3e1 e3

e4

e2

a1 a2 a3e1 e3

e4

e2

a1 a2 a3e1 e3

e4

e2

a1 a2

c

HSDF: schedules

q Overlapping schedule

13 © J. Castrillon. Get-together for IoT, Apr. 2016

a2

a1

a3

a4 Time

PE2

PE1 a4

a3

a1

a2

a4

a3

a1

a2

a4

a3

a1

a2

How to delay the execution of a3?, how to prevent a4 from executing earlier?

Outline

q Introduction

q Models of computation

q Tool flows

q Outlook for IoT

q Summary

© J. Castrillon. Get-together for IoT, Apr. 201614

© J. Castrillon. Get-together for IoT, Apr. 201615

Programming flow: Overview

MoC-based Application

Non-functional specification

PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_rightPNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_right;PNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_left;PNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_right;PNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_src;taskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtr&taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_fft_l;taskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrft_Templ, &taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_ifft_r;taskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrfft_Templ, &taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_sink;taskParams.priority = 1;

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2Architecture

model

Analysis

Synthesis

Code generation

[Castrill11a]

Application model: Beyond HSDF

q SDF: Popular model for static applicationsq Possible to derive equivalent HSDF

q Dynamic DF (DDF): For dynamic applicationsq Flexible firing rules (even non-determinism)

q Kahn Process Networks (KPN): Also dynamicq Deterministic q No firing semantics

16 © J. Castrillon. Get-together for IoT, Apr. 2016

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

a1 a2 a3i1

i2

p1 p2 p3e1 e3

e4

e2

Non/extra-functional: Constraints

q Timing constraintsq Process throughputq Latencies along paths

q Time triggeringq Mapping constraints

q Processes to processors

q Channels to primitivesq Platform constraints

q Subset of resources (processors or memories)q Utilization

© J. Castrillon. Get-together for IoT, Apr. 201617

1 ms

3 ms

1 ms

Architecture model

q Key for retargetable tool flowsq Abstract for speedq Relevant components: processors (types),

interconnect and memories

q Why? Understand effect of executing an actor on the different resources for a schedule

18 © J. Castrillon. Get-together for IoT, Apr. 2016

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

Time

PE2

PE1 a4

a3

a1

a2

a4

a3

a1

a2

a4

a3

a1

a2

Modeling processors

q Abstract models with functional units, latencies

19 © J. Castrillon. Get-together for IoT, Apr. 2016

Execution counts, branch stats and execution traces

[Eusse14]

Modeling communication

q High-level: Available APIs for implementing application channels

q Benchmarking for different platforms

20 © J. Castrillon. Get-together for IoT, Apr. 2016

[Oden13]

hs-serial

hs-serial

hs-serial

hs-serial

hs-serial

parallelRouter(1,0)

FPGA-Interface

Router(1,1)

Router(0,1)

Router(0,0)

Duo-PE0

Duo-PE1

FEC

Duo-PE2

SDDuo-PE6

Duo-PE7

Duo-PE5

Duo-PE3 CM

ADPLL, PM

GT

ADPLL, PM

GT

ADPLL, PM

GT

AVS Contr.

UART-GPIO

ADPLL, PM

GT

ADPLL, PM

GT

ADPLL, PM

GT

ADPLL, PM

GT

ADPLL, PM

GT

ADPLL, PMGT

ADPLL, PMGT

ADPLL, PMGT

ADPLL

ADPLL

DDR-SDRAM-Interface

Tomahawk2_core

APP

Duo-PE4

ADPLL, PMGT

VDSPRISC

VDSPRISC

VDSPRISC

VDSPRISC

VDSPRISC

VDSPRISC

VDSPRISC

ADPLL, PMGT

VDSPRISC

[Arnold13]

Modeling communication

q High-level: Available APIs for implementing application channels

q Benchmarking for different platforms

21 © J. Castrillon. Get-together for IoT, Apr. 2016

[Oden13]

hs-serial

hs-serial

hs-serial

hs-serial

hs-serial

parallelRouter(1,0)

FPGA-Interface

Router(1,1)

Router(0,1)

Router(0,0)

Duo-PE0

Duo-PE1

FEC

Duo-PE2

SDDuo-PE6

Duo-PE7

Duo-PE5

Duo-PE3 CM

ADPLL, PM

GT

ADPLL, PM

GT

ADPLL, PM

GT

AVS Contr.

UART-GPIO

ADPLL, PM

GT

ADPLL, PM

GT

ADPLL, PM

GT

ADPLL, PM

GT

ADPLL, PM

GT

ADPLL, PMGT

ADPLL, PMGT

ADPLL, PMGT

ADPLL

ADPLL

DDR-SDRAM-Interface

Tomahawk2_core

APP

Duo-PE4

ADPLL, PMGT

VDSPRISC

VDSPRISC

VDSPRISC

VDSPRISC

VDSPRISC

VDSPRISC

VDSPRISC

ADPLL, PMGT

VDSPRISC

[Arnold13]

0 1 024 2 048 3 072 4 096 5 120 6 144 7 168 8 1920

500

1 000

1 500

2 000

© J. Castrillon. Get-together for IoT, Apr. 201622

Analysis and synthesis: Overview

Non-functional specification

Applicationspecification

Architecture model

Analysis: Instrumentation, profiling, tracing

Sequential performance estimation

Mapping and scheduling

Mapping configuration

Time-annotated traces

Parallel perf. estimation

Increase resources

For dynamic models!

Tracing: Dealing with dynamic behavior

q KPNs do not have firing semanticsq White model of processes: source code analysis and tracingq Tracing: instrumentation, token logging and event recording

© J. Castrillon. Get-together for IoT, Apr. 201623

TracingSeq. perf. estimation

MappingPar. perf. estimation

Resources

Elapsed time from processor models

Trace-based synthesis

q Synthesis based on code and trace analysis (using simple heuristics)q Mapping of processes and channelsq Scheduling policies

q Buffer sizing24 © J. Castrillon. Get-together for IoT, Apr. 2016

[Castrill10, Castrill10b, Castrill13]

TracingSeq. perf. estimation

MappingPar. perf. estimation

Resources

t

t

t

Time-annotated traces

Mapping, scheduling,

buffer sizing

Non-functional specification

Mapping configuration

Architecture model

TracingSeq. perf. estimation

MappingPar. perf. estimation

Resources

Mapping, scheduling and buffer sizing

q Graph representation of event tracesq Reason about deadlocksq Find critical path

q Downside: For dynamic portions of the application, graph is input dependent

25 © J. Castrillon. Get-together for IoT, Apr. 2016

...

...[Castrill12]

Multiple traces (ongoing work)

q Different input à different behavior (traces)q Characterize behaviors and impact on mapping performance

© J. Castrillon. Get-together for IoT, Apr. 201626

[Goens15]

TracingSeq. perf. estimation

MappingPar. perf. estimation

Resources

...

...

...

Mapping configuration

Mapping configuration

Mapping configuration

Multiple traces (2)

q Different input à different behavior (traces)q Characterize behaviors and impact on mapping performance

27 © J. Castrillon. Get-together for IoT, Apr. 2016

...

...

...

TracingSeq. perf. estimation

MappingPar. perf. estimation

Resources

Behavior

Perf

orm

ance

[Goens15]

Mapping configuration

Mapping configuration

Mapping configuration

Parallel performance estimation

q Discrete event simulator to evaluate a solutionq Replay traces according to mappingq Extract costs from architecture file (NoC modeling, context switches, communication)

© J. Castrillon. Get-together for IoT, Apr. 201628

TracingSeq. perf. estimation

MappingPar. perf. estimation

Resources

Mapping configuration

t

t

t

Time-annotated traces

Trace Replay Module(TRM)

Architecture model

Grantt Chart, platform utilization, channel profiles, ...

t

Contextswitch

Blockedtime

...

Increasing resources

q Add resources to the synthesis until constraints are met

q Add processors and memoriesq Easy for homogeneous platforms

q Non trivial for heterogeneous platforms (ongoing work)

© J. Castrillon. Get-together for IoT, Apr. 201629

TracingSeq. perf. estimation

MappingPar. perf. estimation

ResourcesMapping and scheduling

Parallel perf. estimation

Increase resources

Multiple mapping configurations

q In case multiple applications may run on the system (or system of systems)q Need different configurationsq Adapt to resource and energy budgets

q Reason about mapping equivalences

30 © J. Castrillon. Get-together for IoT, Apr. 2016

=?

[Goens15]

Analysis for Multi-applications

q Use utilization profiles from the discrete event simulator (TRM)q Quickly separate good from bad implementations

31 © J. Castrillon. Get-together for IoT, Apr. 2016

[Castrill14]

Code generation

q Take mapping configuration and generate code accordinglyq From architecture model: APIs, configuration parameters, …q Sample targets: experimental heterogenous systems, TI (Keysonte,

TDA3x), Parallela/Ephiphany, ARM-based (Exynos, Snapdragon)

32 © J. Castrillon. Get-together for IoT, Apr. 2016

PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_right;PNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_right;PNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_left;PNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_right;PNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_src;taskParams.priority = 1;...

Code generationCPN

application

Mapping configuration Architecture

model Debugging info[Castrill11b, Murillo14]

Examples

q Same application(s) automatically compiled to different platforms

q Virtual platforms: SystemC models of full systemsq Bare metalq Multi-RISC/VLIW platform

q Real platform: TI Keystone platformq Speedup on commercial platformsq Code generation against vendor stacks

33 © J. Castrillon. Get-together for IoT, Apr. 2016

Example: multi-media applications

q Iterative mapping

© J. Castrillon. Get-together for IoT, Apr. 201634

© J. Castrillon. Get-together for IoT, Apr. 201635

Manual vs. Automatic: TI Keystone

Image processing Audio filtering application

LTE digital receiver

[Aguilar14]

Outline

q Introduction

q Models of computation

q Tool flows

q Outlook for IoT

q Summary

© J. Castrillon. Get-together for IoT, Apr. 201636

© J. Castrillon. Get-together for IoT, Apr. 201637

So far…

MoC-based Application

Non-functional specification

PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_rightPNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_right;PNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_left;PNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_right;PNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_src;taskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtr&taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_fft_l;taskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrft_Templ, &taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_ifft_r;taskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrfft_Templ, &taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_sink;taskParams.priority = 1;

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2Architecture

model

Analysis

Synthesis

Code generation

What’s missing?

q Reactivityq Supported for periodic events (trigger)q Enough for many applications (automotive)

q Sporadic (I/O) events may break determinism

q Better execution guarantees

q Multi-SoCq Supported if inter-SoC communication is

predictable (same cost-function approach)

q Scalability: Add a layer of “geographic” coarse mapping

38 © J. Castrillon. Get-together for IoT, Apr. 2016

1 ms

3 ms

1 ms

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

Outline

q Introduction

q Models of computation

q Tool flows

q Outlook for IoT

q Summary

© J. Castrillon. Get-together for IoT, Apr. 201640

Summary

q IoT complexity: heterogeneity, resource constraints, distributed, …q The need for formal MoC for programming and synthesisq Sample tool flow

q Heterogeneity: High-level architecture models

q Support dynamic applications: Analysis and synthesis based on tracesq Extensions for: multiple traces

q Outlookq Adaptivity at runtime

q Reactivity and multi-SoC case studies

© J. Castrillon. Get-together for IoT, Apr. 201641

Acknowledgements

q Silexica Software Solutions GmbH

q German Cluster of Excellence: Center for Advancing Electronics Dresden (www.cfaed.tu-dresden.de)

© J. Castrillon. Get-together for IoT, Apr. 201642

References[Castrill14] J. Castrillon , et al., “Programming Heterogeneous MPSoCs:Tool Flows to Close the Software Productivity Gap”, Springer, 2014

[Castrill11a] J. Castrillon, et al., “Trends in embedded softwaresynthesis”, International Conference on Embedded Computer Systems(SAMOS) pp. 347-354, 2011

[Oden13] M. Odendahl, et al., “Split-cost communication model forimproved MPSoC application mapping”, In International Symposium onSystem on Chip pp. 1-8, 2013

[Arnold13] O. Arnold, et al. “Tomahawk - Parallelism andHeterogeneity in Communications Signal Processing MPSoCs”. TECS,2013

[Eusse14] J.F. Eusse, , et al., "Pre-architectural performance estimationfor ASIP design based on abstract processor models," In SAMOS 2014pp.133-140, 2014

[Castrill10] J. Castrillon, , et al., “Component-based waveformdevelopment: The nucleus tool flow for efficient and portable SDR,”Wireless Innovation Conference and Product Exposition (SDR), 2010

[Castrill10b] J. Castrillon, et al., “Trace-based KPN composabilityanalysis for mapping simultaneous applications to MPSoC platforms”, InDATE 2010, pp. 753-758

[Castrill13] J. Castrillon, et al., “MAPS: Mapping concurrent dataflowapplications to heterogeneous MPSoCs,” IEEE Transactions on IndustrialInformatics, vol. 9, 2013

[Castrill12] J. Castrillon, et al. , “Communication-aware mapping ofKPN applications onto heterogeneous MPSoCs,” in DAC 2012

[Goens15] A. Goens, et al., “Analysis of Process Traces for MappingDynamic KPN Applications to MPSoCs”, In IFIP International EmbeddedSystems Symposium (IESS), 2015, Foz do Iguaçu, Brazil (Accepted forpublication), 2015

[Castrill11b] J. Castrillon, et al., “Backend for virtual platforms withhardware scheduler in the MAPS framework”, In LASCAS 2011 pp. 1-4,2011

[Murillo14] L. G. Murillo, et al., “Automatic detection of concurrencybugs through event ordering constraints”, In DATE 2014, pp. 1-6, 2014

[Aguilar14] M. Aguilar, et al., "Improving performance andproductivity for software development on TI Multicore DSP platforms,"in Education and Research Conference (EDERC), 2014 6th EuropeanEmbedded Design in , vol., no., pp.31-35, 11-12 Sept. 2014

© J. Castrillon. Get-together for IoT, Apr. 201643