IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. X, XXXXXXX 2013 … … · 2 IEEE TRANSACTIONS ON...

13
Novel Techniques for Smart Adaptive Multiprocessor SoCs Luciano Ost, Member, IEEE, Rafael Garibotti, Gilles Sassatelli, Member, IEEE, Gabriel Marchesan Almeida, Re ´mi Busseuil, Member, IEEE, Anastasiia Butko, Michel Robert, and Ju ¨rgen Becker, Senior Member, IEEE Abstract—The growing concerns of power efficiency, silicon reliability, and performance scalability motivate research in the area of adaptive embedded systems, i.e., systems endowed with decisional capacity, capable of online decision making so as to meet certain performance criteria. The scope of possible adaptation strategies is subject to the targeted architecture specifics, and may range from simple scenario-driven frequency/voltage scaling to rather complex heuristic-driven algorithm selection. This paper advocates the design of distributed memory homogeneous multiprocessor systems as a suitable template for the best exploiting adaptation features, thereby tackling the aforementioned challenges. The proposed solution lies in the combined use of a typical application processor for global orchestration along with such an adaptive multiprocessor core for the handling of data-intensive computation. This paper describes an exploratory homogeneous multiprocessor template designed from the ground up for scalability and adaptation. The proposed contributions aim at increasing architecture efficiency through smart distributed control of architectural parameters such as frequency, and enhanced techniques for load balancing such as task migration and dynamic multithreading. Index Terms—Adaptive systems, multithreading, adaptive hardware and systems, NoC-based multiprocessor Ç 1 INTRODUCTION D RIVEN by the rapidly increasing Nonrecurring engineer- ing (NRE) and fixed manufacturing costs, the semi- conductor industry shows signs of fragmentation driven by specialization. This leads to the emergence of platform- based design in which off-the-shelf systems-on-chips (SoCs) are integrated in a multitude a branded products. Popular platforms such as TI OMAP, Qualcomm Snapdragon are such prominent examples in the domain of consumer- market smartphones and tablets. These SoCs, typically referred to as multiprocessor SoCs (MPSoCs) are built around an application processor (AP) running an micro- kernel, connected to a number of application specific ISP cores such as DSPs and ASIPs, each of which is devoted to a given functionality such as baseband processing, video decoding, and so on. Such heterogeneous system architec- ture is motivated by concerns of cost and energy efficiency. Because of the heterogeneous nature of such systems, interprocessor communications take place by means of exchanged messages, mostly through an external shared memory, and often orchestrated by the AP operating system. Such system template poses severe scalability problems that may be summarized as follows: 1. Reliability and ageing of silicon in below 22-nm nodes motivate research in the area of resilient architecture design [1]. Most explored solutions rely on the use of spare/redundant processing resources, for either instant fault detection/correction (critical systems) or remapping of functionality on fault detection. Because of being highly heterogeneous, the typical MPSoC template hardly enables devising such techniques. 2. As performance and power efficiency scaling cannot solely rely on technology downscaling, online optimization is of utmost importance, for better exploiting available resources thereby avoiding de- sign oversizing. Though typical embedded MPSoCs employ advanced techniques such as semidistribu- ted power gating, voltage and frequency scaling, the scope of foreseeable techniques remains limited mostly because of the heterogeneous and centralized nature of such systems. 3. The resulting system complexity induces a number of difficulties from a software point of view, with typically hundreds of IPs for which drivers are either closed-source, or provide bare minimum support for the default given purpose of any given IP. Because of this, the system often remains underexploited: user applications are intended to be executed on the AP and not the application-specific ISPs that are not seen as programmable from a end-user/application developer perspective. 1 This paper therefore proposes an alternative system organization that offers online optimization opportunities IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. X, XXXXXXX 2013 1 . L. Ost, R. Garibotti, G. Sassatelli, R. Busseuil, A. Butko, and M. Robert are with Laboratory of Informatics, Robotics and Microelectronic of Mon- tpellier (LIRMM), UMR 5506-CC 477, 161 rue Ada, 34095 Montpellier Cedex 5, France. E-mail: {ost, garibotti, sassatelli, busseuil, butko, robert}@lirmm.fr. . G.M. Almeida and J. Becker are with Karlsruhe Institute of Technology (KIT), Engesserstr. 5 Geb. 30.10, ITIV, KIT, 76131 Karlsruhe, Germany. E-mail: {gabriel.almeida, juergen.becker}@kit.edu. Manuscript received 21 Aug. 2012; revised 16 Jan. 2013; accepted 25 Feb. 2013; published online 14 Mar. 2013. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCSI-2012-08-0579. Digital Object Identifier no. 10.1109/TC.2013.57. 1. GPUs are a notable exception, with the growing interest in General- Purpose computing on GPUs (GPGPU) using APIs such as OpenCL. As these architectures provides significant speedups for very regular code only, they are not discussed further here. 0018-9340/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

Transcript of IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. X, XXXXXXX 2013 … … · 2 IEEE TRANSACTIONS ON...

  • Novel Techniques for SmartAdaptive Multiprocessor SoCs

    Luciano Ost, Member, IEEE, Rafael Garibotti, Gilles Sassatelli, Member, IEEE,

    Gabriel Marchesan Almeida, Rémi Busseuil, Member, IEEE, Anastasiia Butko,

    Michel Robert, and Jürgen Becker, Senior Member, IEEE

    Abstract—The growing concerns of power efficiency, silicon reliability, and performance scalability motivate research in the area of

    adaptive embedded systems, i.e., systems endowed with decisional capacity, capable of online decision making so as to meet certain

    performance criteria. The scope of possible adaptation strategies is subject to the targeted architecture specifics, and may range from

    simple scenario-driven frequency/voltage scaling to rather complex heuristic-driven algorithm selection. This paper advocates the

    design of distributed memory homogeneous multiprocessor systems as a suitable template for the best exploiting adaptation features,

    thereby tackling the aforementioned challenges. The proposed solution lies in the combined use of a typical application processor for

    global orchestration along with such an adaptive multiprocessor core for the handling of data-intensive computation. This paper

    describes an exploratory homogeneous multiprocessor template designed from the ground up for scalability and adaptation. The

    proposed contributions aim at increasing architecture efficiency through smart distributed control of architectural parameters such as

    frequency, and enhanced techniques for load balancing such as task migration and dynamic multithreading.

    Index Terms—Adaptive systems, multithreading, adaptive hardware and systems, NoC-based multiprocessor

    Ç

    1 INTRODUCTION

    DRIVEN by the rapidly increasing Nonrecurring engineer-ing (NRE) and fixed manufacturing costs, the semi-conductor industry shows signs of fragmentation driven byspecialization. This leads to the emergence of platform-based design in which off-the-shelf systems-on-chips (SoCs)are integrated in a multitude a branded products. Popularplatforms such as TI OMAP, Qualcomm Snapdragon aresuch prominent examples in the domain of consumer-market smartphones and tablets. These SoCs, typicallyreferred to as multiprocessor SoCs (MPSoCs) are builtaround an application processor (AP) running an micro-kernel, connected to a number of application specific ISPcores such as DSPs and ASIPs, each of which is devoted to agiven functionality such as baseband processing, videodecoding, and so on. Such heterogeneous system architec-ture is motivated by concerns of cost and energy efficiency.Because of the heterogeneous nature of such systems,interprocessor communications take place by means ofexchanged messages, mostly through an external sharedmemory, and often orchestrated by the AP operatingsystem. Such system template poses severe scalabilityproblems that may be summarized as follows:

    1. Reliability and ageing of silicon in below 22-nmnodes motivate research in the area of resilientarchitecture design [1]. Most explored solutions relyon the use of spare/redundant processing resources,for either instant fault detection/correction (criticalsystems) or remapping of functionality on faultdetection. Because of being highly heterogeneous,the typical MPSoC template hardly enables devisingsuch techniques.

    2. As performance and power efficiency scaling cannotsolely rely on technology downscaling, onlineoptimization is of utmost importance, for betterexploiting available resources thereby avoiding de-sign oversizing. Though typical embedded MPSoCsemploy advanced techniques such as semidistribu-ted power gating, voltage and frequency scaling, thescope of foreseeable techniques remains limitedmostly because of the heterogeneous and centralizednature of such systems.

    3. The resulting system complexity induces a numberof difficulties from a software point of view, withtypically hundreds of IPs for which drivers are eitherclosed-source, or provide bare minimum support forthe default given purpose of any given IP. Because ofthis, the system often remains underexploited: userapplications are intended to be executed on the APand not the application-specific ISPs that are notseen as programmable from a end-user/applicationdeveloper perspective.1

    This paper therefore proposes an alternative system

    organization that offers online optimization opportunities

    IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. X, XXXXXXX 2013 1

    . L. Ost, R. Garibotti, G. Sassatelli, R. Busseuil, A. Butko, and M. Robert arewith Laboratory of Informatics, Robotics and Microelectronic of Mon-tpellier (LIRMM), UMR 5506-CC 477, 161 rue Ada, 34095 MontpellierCedex 5, France.E-mail: {ost, garibotti, sassatelli, busseuil, butko, robert}@lirmm.fr.

    . G.M. Almeida and J. Becker are with Karlsruhe Institute of Technology(KIT), Engesserstr. 5 Geb. 30.10, ITIV, KIT, 76131 Karlsruhe, Germany.E-mail: {gabriel.almeida, juergen.becker}@kit.edu.

    Manuscript received 21 Aug. 2012; revised 16 Jan. 2013; accepted 25 Feb.2013; published online 14 Mar. 2013.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCSI-2012-08-0579.Digital Object Identifier no. 10.1109/TC.2013.57.

    1. GPUs are a notable exception, with the growing interest in General-Purpose computing on GPUs (GPGPU) using APIs such as OpenCL. Asthese architectures provides significant speedups for very regular codeonly, they are not discussed further here.

    0018-9340/13/$31.00 � 2013 IEEE Published by the IEEE Computer Society

  • beyond classical techniques, thereby furthering potentialbenefits. We regard the heterogeneous and centralizednature of MPSoCs as being the main limiting factors towardachieving better scalability. Though the use of an AP is notquestionable, we argue that a subset of domain-specific ISPscan be advantageously replaced by a fully decentralizedhomogeneous multiprocessor core designed for scalabilityand adaptation, similar to the major shift observed in GPUarchitectures that have become homogeneous for achievinga better utilization of processing resources.

    The multiprocessor core, used in this study and pre-viously proposed by Busseuil et al. [2], is available underthe GNU General Public License at [3]. This core is based onan array of loosely coupled independent processors thatcommunicate through exchanged messages routed in a2D-mesh NoC. Each processor is equipped with a localmemory that stores both application code and a compactmicrokernel that provides a number of services amongwhich online optimization functionalities, from now onreferred to as adaptation,2 which are either executed in anative distributed fashion or in a semicentralized mode:

    . First, based on the proposed architecture template, wedemonstrate the benefits of local distributed controlthrough an implementation of feedback control loopsfor adjusting frequency at processor level.

    . Second, we propose a novel approach to loadbalancing named remote execution (RE), whichmakes for a tradeoff between reactivity and perfor-mance compared to task migration.

    . Third, though message passing is the defaultprogramming model for such a distributed memoryarchitecture, we describe a distributed shared-memory (DSM) approach that enables the architec-ture to operate in multithreading mode using athread API similar to POSIX threads. This approachallows for exploiting load balancing at intratask levelthereby giving another degree of freedom in theadaptation process; and further paves the way forsemi general-purpose application acceleration.

    The rest of this paper is organized as follows: Section 2presents and discusses relevant approaches in the literature,and put these in perspective with the current work. Section 3then describes the used multiprocessor core hardware andsoftware architecture as well as the programming models.Section 4 is devoted to the first two contributions, use offeedback control-loops for distributed control and RE.Section 5 presents the proposed multithreading approach,and provides benchmarking results for scalability assess-ment. Section 6 is a wrap-up of all proposed techniques thatare compared from an adaptation perspective on a casestudy. Section 7 draws conclusions and proposes futurework directions.

    2 STATE OF THE ART

    Adaptation refers to online optimization mechanisms thatmay take place at several abstraction levels: 1) architecture(low level); and 2) system (high level). In the first case, because

    of being dealt with by the hardware itself, quasi-instantdecisions may be made at the price of somewhat limitedcapability of analysis. Typically, decisions are made onthreshold crossing and often relate to tuning of architectureparameters such as voltage or frequency. Phenomena thattake place at a larger time scale with, therefore, slowerdynamics are handled at system level, often by the software:In this latter case, finer analysis is foreseeable (i.e., statisticalinformation gathering), and possible adaptation strategiescover techniques such as task migration, route reconfigura-tions, computation/communication rescheduling, and so on.

    2.1 Adaptation at Architecture Level

    Dynamic Voltage and Frequency Scaling (DVFS) techniqueshave long been used in portable embedded systems toreduce the energy consumption and temperature of suchsystems. Most existing DVFS are defined at design time,where they are typically based on predefined profilinganalysis (e.g., application data) that attempts to define anoptimal DVFS configuration [4], [5]. Puschini et al. [6]propose a scalable multiobjective approach based on gametheory, which adjusts at runtime the frequency of eachprocessing element (PE) while reducing tile temperatureand maintaining the synchronization between applicationtasks. Due to the dynamic variations in the workload ofthese systems and the impact on energy consumption,other adaptation techniques such as Proportional-Integral-Derivative (PID)-based control have been used to dynami-cally scale voltage and the frequency of processors [7], [8],and recently, of network-on-chip (NoC) [9], [10]. Thesetechniques differ in terms of monitored parameters (e.g.,task deadline, temperature) and response times (periodnecessary to stabilize a new voltage/frequency).

    Ogras et al. [9] propose an adaptive control technique fora multiple clock domain NoC, which considers the dynamicworkload variation aiming at decreasing the power con-sumption. Besides, the proposed control technique ensuresthe operation speed and the frequency limits of each island(clock domain, defined according to a given temperature).The effectiveness of the proposed adaptive control techni-que was evaluated considering different scenarios (e.g.,comparison with a PID controller) for a MPEG-2 encoder. In[10], a PID controller is used to provide a predeterminedthroughput for multiple voltage-frequency/voltage-clockNoC architectures. The proposed PID-based controller setsthe voltage and frequency of each NoC island by reservingvirtual channels weights (primary control parameter) toprovide the necessary throughput for different applicationcommunications, while saving energy.

    In [11], authors show a technique for minimizing the totalpower consumption of a chip multiprocessor while main-taining a target average throughput. The proposed solutionrelies on a hierarchical framework, which employs coreconsolidation, coarse-grain DVFS. The problem of optimallyassigning dependent tasks in multiprocessor system hasnot been addressed in their paper. Adaptability has beenexplored under a number of aspects in embedded systems,ranging from adaptive modulation used in Third GenerationPartnership Project Long Term Evolution (3GPP-LTE) stan-dard Software Defined Radio (SDR) [12] to adaptive coresinstantiation in dynamically reconfigurable FPGAs [13].

    2 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. X, XXXXXXX 2013

    2. The concept of adaptation (sometimes referred to as self-adaptation inthe literature) used throughout this paper is similar to that used in biology:it refers to the process of adapting to time-changing situations, driven bythe entity that undergoes the process itself.

  • 2.2 Adaptation at System Level

    Adaptation at system level, here understood as a means fordynamically assigning tasks to nodes so as to optimizesystem performance, can be done in two ways: 1) by usingdynamic task mapping heuristics; and 2) task migration(TM) for enabling load balancing. Dynamic mappingheuristics are intended to efficiently map at runtimeapplication tasks from a repository to the availableprocessors in the architecture. Several dynamic mappingheuristics have been proposed [14], [15], [16], [17]. Suchheuristics aim to satisfy QoS requirements (e.g., [15]), tooptimize resources usage (e.g., network contention [18],[16]), and to minimize energy consumption (e.g., [18], [19],[17]). Each of those has its own parameters and costfunctions, which can have numerous variations (defined asnew heuristics with different properties and strengths).Thus, choosing the optimal dynamic heuristic for a set ofapplications is not trivial; hence, evaluating differentperformance metrics is time consuming. Alternatively, TMmay employ mapping heuristics that operate on monitoringinformation (e.g., processor workload, communicationlatency) to better fit current execution scenario. Streichertet al. [20] employ TM to ensure computation correctness ofan application by migrating executing tasks from a faultynode to a healthy one. Authors propose an analyticalmethodology that uses binding to reduce the overhead ofthe task migration. However, neither architecture detailsnor task mechanisms are presented in this work. A similaranalytical-based approach is presented in [21], whichproposes five task migration heuristics. These heuristicsare employed to remapping tasks onto remaining nonfaultynodes after the identification and isolation of a given fault.

    Shen and Petrot [22] discuss the software and hardwarearchitecture features that are necessary to support TM inheterogeneous MPSoCs. In [23], authors propose a policythat exploits runtime temperature as well as workloadinformation of streaming applications to define suitableruntime thermal migration patterns. In [24], authors presenta migration case study for MPSoCs that relies on the�Clinux operating system and a check pointing mechanism,which defines possible migration points in the applicationcode. This work was extended in [25], where an OS andmiddleware infrastructure for task migration for theirMPSoC platform is described.

    Barcelos et al. [26] explore the use of TM in a NoC-basedsystem (high-level model) that comprises both shared anddistributed memory organizations, aiming to decrease theenergy consumption when transferring the task code. Thedistance between the nodes involved in the code migrationis used to define, at runtime, the memory organization thatwill be used. In sum, this section presented related work onadaptive techniques at both architecture and system level:1) adaptation at architecture level (DVFS), 2) adaptation atsystem level a) dynamic task mapping heuristics; b) taskmigration in both shared memory and distributed memoryarchitectures. Our approach differs from the existing in thefollowing aspects:

    . Several contributions have been presented in thesetwo levels but none of them propose an adaptivemultiprocessor combining all three features. Our

    approach employs task mapping mechanism at twolevels: by means of using dynamic mapping heur-istics to provide a better initial state to the system,and once tasks are mapped into the platform taskmigration mechanism enables achieving load balan-cing among different PEs in the architecture.

    . To the best of our knowledge, the used embeddedmultiprocessor core is the only distributed memorysystem, with RTL implementation, that supportstask migration without any help of shared memory;which makes the template scalable.

    This paper puts focus on actuation mechanisms rather

    than decision making algorithms and extends from pre-

    vious work in three key directions: 1) a novel purely

    distributed adaptation technique based on PID controllers is

    proposed and benchmarked in term of performance and

    power efficiency; 2) novel load-balancing schemes that

    feature faster reactivity for load balancing compared to task

    migration are proposed and compared with the previous

    work; and 3) finally, a DSM implementation that allows for

    finer grain load balancing is described and benchmarked.

    3 PROPOSED ARCHITECTURE TEMPLATE

    This section describes the architecture of OpenScale [2], i.e.,

    an open-source RTL multiprocessor core designed for

    scalability and adaptation. To enable scaling this architec-

    ture to any arbitrary size, it makes use of distributed

    memories so as to avoid usual shared memory bottlenecks.

    Advanced adaptation features such as task migration are

    enabled thanks to a custom microkernel capable of handling

    complex procedures such as code migrations across

    distributed physical memories, and so on.

    3.1 Hardware Architecture

    The adopted architecture is a homogeneous message-

    passing NoC-based multiprocessor core with distributed

    memory. Each node comprises a 32-bit pipelined CPU with

    configurable instruction and data caches. This core imple-

    ments the Microblaze ISA architecture and features a timer,

    an interrupt controller, a RAM and optionally an UART [27].

    These components are interconnected via a Wishbone v4

    open-source bus. The communication between the node and

    the NoC router is implemented in the network interface (NI),

    which performs low-level packetization/depacketization.The adopted NoC relies on a mesh of tiny routers based

    on HERMES infrastructure [28]. The NoC employs packet

    switching of wormhole type: The incoming and outgoing

    ports used to route a packet are locked during the entire

    packets transfer. The routing algorithm is of XY type

    (Hamiltonian routing) that allows deterministic routing.

    Each router has one incoming FIFO buffer per port. The

    depth of FIFOs can be chosen according to the desired

    communication performance. Fig. 1 depicts the platform

    along with the two supported programming models that

    are described in Section 3.3, message passing and multi-

    threading through runtime virtual Symmetric Multi Pro-

    cessor (vSMP) cluster definition.

    OST ET AL.: NOVEL TECHNIQUES FOR SMART ADAPTIVE MULTIPROCESSOR SOCS 3

  • 3.2 Software Architecture

    The microkernel used in our platform was originallydeveloped by Rhoads [29] and extensively modified toimplement and provide new services needed for proposingnew adaptive mechanisms for homogeneous MPSoCarchitectures. This choice was mostly motivated by thesmall memory footprint (around 150 KB) of this microkernelgiven the provided features.

    Applications are modeled as tasks that access hardwareresources through an API. A function call then generates asoftware exception, handled by the operating systemException Manager. Additionally, the Scheduler communi-cates with the memory and task management modules, andexchanges messages with the decision-making mechanisms.The Routing Table Manager keeps track of tasks’ locationstoring the information in a data structure called routingtable. Whenever a task has to communicate with another,the operating system performs a lookup in the local routingtable. Besides, Adaptation Mechanisms are implemented toprovide load balancing as well as optimize platformsresources utilization, namely by scaling processor fre-quency according to application requirements.

    One of the objectives of this work is to enable dynamicload balancing which implies the capability to migraterunning tasks from processor to processor. Migrating tasksusually implies: 1) to dynamically load corresponding codein memory and schedule a new process; 2) to restore thecontext of the task that has been migrated. A possibleapproach relies on resolving all references of a given code atload time; such a feature is partly supported in theExecutable and Linkable Format (ELF) [30], which lists thedynamic symbols of the code and enables the operatingsystem loader to link the code for the newly decided memorylocation. Such mechanisms heavily rely on the presence of ahardware Memory Management Unit (MMU) for addressvirtualization and are memory consuming, which clearlyputs this solution out of the scope of the approach.

    Another solution for enabling the loading of processeswithout such mechanisms relies on a feature that is partlysupported by the GCC compiler that enables to emitrelocatable code (PIC—Position Independent Code). Thisfeature generally used for shared libraries generates onlyrelative jumps and accesses data locations and functionsusing a Global Offset Table (GOT) that is embedded intothe generated ELF file. A specific postprocessing tool,which operates on this format, was used for reconstructinga completely relocatable executable. Experiments per-formed in [31] show that both memory and performanceoverheads remain under 5 percent for this solution, whichis clearly acceptable.

    3.3 Programming Model

    3.3.1 Message Passing Interface (MPI)

    Programming takes place using a message-passing API.Hence, tasks are hosted on nodes, which provide throughtheir operating system communication primitives thatenable data transfers between communicating tasks. Theproposed model uses two P2P communication primitives,MPI SendðÞ and MPI ReceiveðÞ, based on synchronousMPI. MPI ReceiveðÞ blocks the task until the data areavailable while MPI SendðÞ is nonblocking, unless nobuffer space is available.

    Each communicating task may have one or multiplesoftware FIFOs. In case a FIFO runs full, the operatingsystem resizes it according to the available memory. Thisprocess is handled by a flow control mechanism im-plemented as a service in the OS. This strategy allowssaving resources once memory is dynamically allocated/deallocated according to communication needs.

    3.3.2 Shared Memory/Multithreading

    The work presented above relies on the assumption ofeconomically viability of numerous on-chip distributedmemories. While this may prove doable in the near futurebecause of advances in the area of dense nonvolatileemerging technology memories [32] such as MagneticRAMs (MRAMs) and Phase-Change Memories (PCM),devising an approach that makes it possible to sharememory would further provide the distinct advantage ofopening the way for multithreading, which is by far themost popular parallel programming model in the area ofgeneral purpose computing. We here present a low-overhead hardware/software DSM approach that makessuch distributed-memory architecture multithreading cap-able. The proposed solution has been implemented into themultiprocessor core through developing a POSIX-likethread API. This approach efficiently draws strengths fromthe on-chip distributed memory nature of the originalarchitecture template that, therefore, retains its intrinsicpower efficiency, and further opens the way to exposing themultithreading capabilities of that component as a generalpurpose accelerator.

    3.4 Benchmarking Application Kernels

    Four application kernels were employed for benchmarkingboth adaptation techniques and multithreading extensions.These application kernels are: motion JPEG (MJPEG) videodecoder, Smith Waterman (used to find similar regionsbetween two DNA sequences [33]), LU (matrix factoriza-tion), and Fast Fourrier Transform (FFT). For each of thesecompute kernels, both a message passing (making use ofthe above-mentioned primitives) and a multithreadedversion were developed. Contrary to MPI implementationin which communications are explicit, Pthread-basedimplementations rely on threads concurrently accessingshared variables residing in a shared memory assumedcoherent. For instance, MJPEG threads process entireimages, whereas Smith-Waterman implementation relieson worker threads that run sequence alignment algorithmon different data sets. Similarly, LU factorizes a matrix as theproduct of a lower and an upper triangular matrix and eachthread processes different data sets. In turn, FFT is made of anumber of threads, each of which performs a number ofbutterfly operations for minimizing data transfers.

    4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. X, XXXXXXX 2013

    Fig. 1. OpenScale platform.

  • Table 1 gives benchmark-specific figures expressed interms of single-thread processing time, computation/dataratio, and code size. Due to the different natures of thoseapplications, execution time varies greatly and ranges fromless than 2 million clock cycles (4 ms at 500 MHz) to over13 millions clock cycles (26 ms at 500 MHz). The secondimportant factor is the computation to data ratio. Thisnumber is the ratio between the number of actual computeinstructions (e.g., add, mul, jump, etc.) over the number ofmemory instruction (load/store) executed by a thread. Thisfigure gives an overview of the importance of instructionloading with respects to data loading. Hence, an applicationhaving a high computation to data loading ratio would becompute dominated, therefore, with proportionally lessdata accesses. Note that Smith Waterman has the highercomputation to data ratio, when compared to the others.

    The fourth column in Table 1 shows the thread code size

    of each application. A thread with large code size would

    potentially lead to more instruction cache misses, because it

    would likely not fit entirely inside the cache.

    4 NOVEL ADAPTIVE TECHNIQUES

    Adaptation draws strengths from the ability to handleunpredictable scenarios: Given the large variety ofpossible use cases that typical MPSoC platforms mustsupport and the resulting workload variability, offlineapproaches are no longer sufficient because they do notallow coping with time changing workloads. This sectionpresents three adaptive techniques implemented in theOpenScale platform, previously introduced in Section 3.We regard the system as having a number of steady states(i.e., task mappings), going from one such state to anotheroccurs whenever a system-level adaptation technique(such as a task migration) takes place, i.e., a transientphase between two steady states. The two levels at whichadaptation is implemented are

    . architecture level: Adaptation at this level addressesoptimization when the system operates in a steadystate: we here assume soft real-time applications;therefore, a target performance-level has to be metwhile minimizing power.

    . system level: Adaptation at this level triggers remap-pings in the architecture, resulting into switchingfrom one steady state to another. In that precise case,the system operates in the best effort mode,attempting at reaching another steady state in theshortest possible time.

    4.1 Architecture Level Optimization

    As previously mentioned, OpenScale runs an microkernelthat may generate multiple perturbations due to the complex-ity of managing all the provided services such as dynamictask loading, message passing protocol, task migration,frequency scaling, among others. This is made further worseby the data dependent nature of some algorithms for whichworkload may greatly vary depending on input data, such asvideo decoding applications. The distributed frequencyscaling (DFS) mechanism must be efficient enough for copingwith such scenarios so that soft-real-time applicationrequirements are satisfied. Next section presents a smartDFS strategy that makes it possible to meet real-timeapplication requirements in the presence of perturbationsand fluctuating workload.

    4.1.1 Dynamic and DFS

    A PID controller is a generic control loop feedbackmechanism widely used in industrial control systems. Itcalculates an error value as the difference between ameasured process variable and a desired set point. Thecontroller attempts to minimize the error by adjusting theprocess control inputs. However, for the best performance,the PID parameters used must be tuned according to thenature of the process to be regulated. We here propose theuse of PID controllers at task-level as pictured in Fig. 2: anoffline profiling permits extracting reference performancevalues (such as throughput, interarrival time, jitter, etc.)that are then used as set points for each application taskPID controller that then continuously sets the frequencyaccordingly. This approach aims at setting the lowestpossible frequency that satisfies timing constraints, so thatminimum energy is wasted in idle cycles in each processor.Once our platform is fully parameterizable, it is possible,by executing different experiments, to extract the mostappropriate period for executing the DVFS control. Thisfunctionality is implemented as a service in the micro-kernel and is triggered accordingly to the chosen period bymeans of interrupts.

    The proportional, integral, and derivative terms aresummed to calculate the output of the PID controller.Defining uðtÞ as the controller output, the final form of thePID algorithm is

    OST ET AL.: NOVEL TECHNIQUES FOR SMART ADAPTIVE MULTIPROCESSOR SOCS 5

    TABLE 1Performance Information of Adopted Applications

    Fig. 2. Distributed PID controllers.

  • uðtÞ ¼MV ðtÞ ¼ KpeðtÞ þKiZ t

    0

    eð�Þd� þKdd

    dteðtÞ:

    1. Proportional gain, Kp: Larger values typically meanfaster response because the larger the error, thelarger the proportional term compensation.

    2. Integral gain, Ki: Larger values imply steady stateerrors are eliminated more quickly.

    3. Derivative gain, Kd: Larger values decrease over-shoot, but slow down transient response and maylead to instability due to signal noise amplification inthe differentiation of the error.

    Each running application is composed of one or multipletasks monitored by a system thread responsible forcalculating applications performance in a nonintrusive mode.This information is passed on to the PID controller, whichscales frequency up or down whenever a deviation isobserved from the set point value. The strategy consists indeciding controllers parameters on a task basis. To thispurpose, a simulation of the MPSoC system is executed toobtain the system step response, here defined as theobtained application throughput under frequency change.Based on this high-level model, a number of differentconfigurations of controllers can be explored, each of whichexhibits different features such as speed, overshoot, staticerror, and so on. In this scenario, the values of P , I, and Dhave been chosen aiming to increase the reactivity of thesystem. The system is modeled in Matlab and the differentvalues for P , I, and D are set.

    4.1.2 Experimental Protocol and Results

    The following scenario is based on an audio/videodecoding application which includes Adaptive DifferentialPulse-code Modulation (ADPCM) decoder, MJPEG, andFinite Impulse Response (FIR). MJPEG video decoderapplication is composed of five tasks running on aprocessor and a different application (a synthetic task (P2)implemented in such way that it requires a considerableamount of processing power) is migrated to the processor,disturbing the MJPEG application performance. Threedifferent perturbation scenarios (P1; P2, and P3) arecreated based on different perturbation factors defined,respectively, as follows: 1) low, 2) medium, and 3) high,represented in Table 2. ATBP represents the AverageThroughput Before Perturbation while ATAP representsthe Average Throughput After Perturbation.

    The following scenario depicts the system responseunder perturbation condition (P3). Application perfor-mance is analyzed for two approaches: 1) no adaptation isdone and processor frequency is constant and 2) processorfrequency is controlled by a PID. The perturbation is

    configured as follows: 1) perturbation start time: 300 ms;2) perturbation period: 700 ms; 3) application performanceimpact factor: 58 percent.

    Fig. 3 shows throughput evolution for both fixed-frequency and PI-driven adaptation. Without PI controller,task throughput is reduced to approximately 440 kB/sduring the perturbation period. Considering a soft real-timevideo decoding application, where performance constraintsmust be met so as to ensure constant frame rate, thisscenario could result in freezes/frame skips. When using aPI controller, it takes only 100 ms to restore initial set pointthat represents the minimal performance requirements of agiven application. Shortly after perturbation end, a peak intask throughput is observed due to the fact the processor isrunning at high frequency to compensate the perturbationoccurring this far. The PI controller gradually lowers thefrequency, throughput goes back down to set point, whichresults in power savings.

    Table 3 presents the power and energy consumption forthree different perturbation scenarios. It can be observedthat power and energy consumption are significantlyreduced when using PI- and PID-based controllers. In S1,the energy consumption is reduced by 32 percent whilepower consumption is reduced by 42 percent in S3. It isimportant to observe that the energy/power consumptionsaving can be much higher in scenarios with longerperturbation periods and where impact factor of perturba-tions over applications is higher. It is further import to note

    6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. X, XXXXXXX 2013

    TABLE 2Different Perturbation Scenarios

    Fig. 3. Perturbation Representation—PI versus no PI.

    TABLE 3Node Power and Energy Consumption Representationfor Three Different Scenarios Using 65-nm Technology

  • that such an approach that promotes continuous andfrequent frequency scaling may not be optimal dependingon the targeted platform, because of constraints onselectables voltage/frequency couples for instance. Asbeing implemented in form of a service in the microkernel,easy tuning of actuation frequency, frequency range,number of selectable frequencies or even use of hysteresisthresholds is possible to best fit the platform/application/technology specifics.

    4.2 System Level Optimizations

    As mentioned previously, system-level adaptation techni-ques are seen as transient phases between steady states.Because of often incurring penalty on application perfor-mance, the platform runs those technique in the best effortmode so as to resume normal application execution in theshortest possible time. Benchmarking results for the twotechniques proposed in this section are presented anddiscussed in Section 6.

    4.2.1 Task Migration with Redirection (TMR)

    Task migration (TM) mechanisms are implemented bymeans of tight hardware/software interaction that handlecode migration from source node local memory to destina-tion node local memory. The basic mechanism relies oninterrupting data transmission between migrating task andall of its predecessors that buffer produced data untilcompletion of migration procedure: Handshaking messagesare exchanged between predecessor(s) node(s), sourcenode, and destination node. Once migration of code acrossphysical memories of source and destination nodes iscomplete, data transmission is resumed starting withbuffered packets.

    The migration time very much depends on nodeprocessor workload, NoC traffic; and task size. Howeverthe latency is typically in the order of 10 to 20 ms at 500-MHzfrequency. This, therefore, gives an assessment of themigration cost; overall migrations shall not be triggeredmore that 10 times per second at that frequency otherwiseperformance penalty may become significant.

    TMR mechanism aims at speeding up migration andoperates local buffering on the source node rather than thepredecessor node, this allows for 1) asynchronous update ofpredecessor node(s) routing table(s) resulting in databuffered on both the source node and destination nodeand 2) immediate redirection of buffered packets right aftercode has been transferred.

    4.2.2 Remote Execution

    Different from TM and TMR techniques, in the REmechanism the task is remotely executed in the destinationnode, using a remote memory access (RMA) protocolimplemented in the NoC. The RE protocol is illustrated inFig. 4. First processor in node A interrupts t1 execution(step 1) and then sends an execution request with thecurrent task state (step 2). Once the chosen node B grantedits request, t1 is executed in processor (node B) by fetchingits code from the remote processor in node A (step 3). Toimprove execution efficiency, instructions fetched remotelyare cached in the L1 instruction cache of processor (node B).

    It is important to note that the programming philosophyof the platform here remains message passing and notshared memory. Hence, only few data (static data) arefetched from remote memory during a RE, with the maindata flow coming from messages. In this configuration,data remain uncached during a RE. This strategy wastaken considering that we have on-chip memory, withmemory access latency few orders of magnitude less thanstandard shared memory architecture using off-chipmemory. Furthermore, using uncached data avoids theuse of complex cache coherency mechanisms between datacache and remote memories. Instructions, however, areentirely cached.

    To support this protocol, a hardware module calledRMA was implemented. The RMA is composed of an RMA-Send and an RMA-Reply, which enable distant memoryaccess through the NoC. Two asynchronous FIFOs areemployed to guarantee the communication between bothmodules and the NoC, as shown Fig. 4. The purpose of theRMA-Send is to handle memory requests from the localCPU (step 1 and step 3). In case of write request, the RMA-Send component sends the data that must be stored in tasknode (e.g., in Fig. 4 node B transfers the resulting data ofexecution of t1 to the node A). In case of a read request, theRMA-Send component requests the desired data to theremote node (e.g., in Fig. 4 node B requesting instructioncode to the node A). In turn, the RMA-Reply moduleanswers the request coming from the RMA-Send (NoCside—step 2). In case the incoming request is granted, thedata are stored in the memory. In a read request, the dataare read from the memory, packetized and sent to therequesting node.

    The overall memory access performance depends ontwo main factors. First, as our architecture provides onlyL1 cache, the cache-miss rate is particularly important,as it determines the number of RMAs, hence the traffic.Therefore, RMA performance depends on: 1) cache size,2) cache policy, and 3) number of RMAs resulting fromapplication.

    5 SIMULTANEOUS MULTITHREADING

    OpenScale being natively a distributed memory system,message passing is the default programming model, and taskmigration the natural means for achieving load balancing. To

    OST ET AL.: NOVEL TECHNIQUES FOR SMART ADAPTIVE MULTIPROCESSOR SOCS 7

    Fig. 4. Remote task execution protocol.

  • both offer an alternative programming model and enhanceadaptive load balancing, we developed a DSM approach. Inthis context, the in-house microkernel described in Section3.2 was modified to support a configurable processormemory mapping that permits specifying the ratio of localprivate data versus local shared data, made visible to allprocessors in the cluster as shown in the Fig. 5.

    Contrary to message passing in which communicationsare explicit, shared memory/thread-based parallelism relieson threads concurrently accessing shared variables residingin a shared memory assumed coherent. Synchronizationsare handled by means of deciding a specific memoryconsistency model and an API such as the POSIX threadsstandard that offers a range of primitives (mutex, sema-phores, fork/join, etc.). Implementing a coherent memoryarchitecture proves challenging in case several physicalmemories are used: in such configurations complex cachecoherence mechanisms are required, machines of this typeare usually referred to as cache-coherent nonuniformmemory architecture (ccNUMA).

    The proposed approach offers the opportunity to decideshared-memory clusters of arbitrary size and geometry atruntime: Nodes belonging to the clusters go into bondingmode and share part of the host node memory (one percluster). A POSIX-like thread API implementation alongwith DSM hardware support is presented in this sectionand benchmarked.

    5.1 Thread API and Memory Consistency Model

    The used thread API is similar to POSIX and has functionsbelonging to three categories: Thread creation, mutexes, andbarriers. This implementation relies on a relaxed consis-tency memory model in which memory is guaranteedcoherent at synchronization points, i.e., whenever an APIfunction call is made. To further lower performanceoverhead related to memory consistency, the API requiresshared data be explicitly flagged as such, which accountsfor the only difference in function prototype with POSIXthread API, all other being otherwise exactly similar. Amultithreaded task in our implementation uses the messagepassing API for communication with other tasks, but createsworker threads on remote processor nodes. During threadcreation, data caches are flushed on the caller side, andinvalidated on the executer side. During mutex lock, cachelines that possibly contain shared data that can be accessedbetween the locking and unlocking are invalidated. Duringmutex unlock, those same lines are flushed. During barriercalls, the shared memory is also flushed and invalidated.This memory consistency model limits the cache coherencemechanism to the bare minimum thereby simplifyingsoftware and hardware support, and features limitedperformance overhead: Invalidation and flush of a given

    cache line occur only if cache line tag corresponds to theaddress specified by the instruction. This condition avoidsunnecessary cache flushes/invalidations of cache linescontaining unrelated data.

    5.2 Multithreading Hardware Support

    The RMA has been purposely modified compared to thatused in RE, so as to better support specific memoryconsistency operations. It is composed of an RMA-Sendand an RMA-Reply, which enable distant memory accessthrough the NoC. Two asynchronous FIFOs are employedto guarantee the communication between both modulesand the NoC.

    The RMA module contains a cache miss handlingprotocol that comprises three main steps: 1) whenever acache miss occurs, the CPU issues a cache line requestrouted through the NoC, 2) the host RMA reads thedesired cache line (i.e., instruction/data), which is sentback to the remote CPU that resumes the execution of thethread as soon as the cache line is received 3). This protocolhas a latency of 182 clock cycles at zero NoC load, from thecache line request to the remote thread execution, asillustrated in Fig. 6.

    The maximum end-to-end bandwidth of 90 MB/s at500 MHz in this implementation. Note that the host RMA iskept busy during a fraction of that time (64 clock cycles), amaximum theoretical bandwidth of 250 MB/s exists whenseveral requests are aliased.

    5.3 Experiments and Results

    The platform is configured as follows:

    . 3� 3 processor array, NoC with four positions inputbuffers and 32 bits channel width.

    . processor cache size was set to 4, 8, and 16 kB, eightwords per lines.

    . 500-MHz frequency for both processor nodes andNoC routers.

    . CPU: hardware multiplier, divider, and barrelshifter.

    Fig. 7 shows the thread mapping used for experiments.In those, shared data reside in the top-left processor node,and are accessed by threads running in other nodes. This

    8 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. X, XXXXXXX 2013

    Fig. 5. DSM node organization. Fig. 6. Cache miss RMA protocol.

    Fig. 7. Adopted scenario.

  • mapping results in less bandwidth available, as only twoNoC ports are available. This scenario was used on purposeso as to enable identifying possible limitations/bottlenecks

    of the proposed system. For the sake of simplicity, Fig. 7cdoes not illustrate all RMAs. All results were gathered on asynthesizable RTL VHDL description of this accelerator.Note that all features inherent to prototype (e.g., runtimevSMP cluster definition) were also evaluated in FPGA

    (i.e., simplistic scenarios due to the memory limitations).

    5.3.1 Reference Platforms

    The GEM5 Simulator framework [34] was used to producetwo shared-memory reference platforms: bus- and NoC-based platforms. GEM5 was chosen because of providing a

    good tradeoff between simulation speed and accuracy,while modeling a real Realview Platform Baseboard Explorefor Cortex-A9 (ARMv7 A-profile ISA).

    To create our first reference platform (Fig. 8a), it wasnecessary to modify the CPU model and Linux bootloader

    supported in GEM5 to enable configurations with more thanfour cores. The bus-based platform, illustrated in Fig. 8a, isconfigured as follows:

    1. up to eight ARM Cortex-A9 cores,2. CPU running at 500 MHz,3. Linux Kernel 2.6.38,4. 16-kB private L1 data and instruction caches,5. 32-bits channel width and6. DDR physical memory running at 400 MHz.

    Fig. 8b shows the main architecture of our second

    reference platform, which comprises

    1. up to eight ARM Cortex-A9,2. CPU running at 500 MHz,3. Linux Kernel 2.6.38,4. 16-kB private L1 data and instruction caches,5. 32-bits channel width,6. 256-MB unified L2 cache, and7. an interconnection network.

    Such a large L2 cache size was decided for avoiding anytraffic with external DRAM memory during benchmarkexecution, so as to perform fair comparisons with theproposed architecture template that uses on-chip memoriesonly. The Cache Coherence Protocol that was used isMOESI Hammer (AMD’s Hammer protocol used in AMD’sHammer chip [35]). Three interconnection network topol-ogies were use as reference NoC-based models: 1) cross-bar—each controller (L1/L2/Directory) is connected toevery other controllers via one router (modeling thecrossbar), 2) mesh, and 3) torus. In both Mesh and Torustopologies, each router is connected to a L1 and a L2.

    5.3.2 Speedup

    The first experiment evaluates the platform scalability

    considering MJPEG, Smith Waterman, FFT, and LU during

    their execution. To maximize speedup, one thread per nodewas used in all experiments, thereby avoiding performance

    penalties resulting from context switching. Followingresults are CPU normalized so as to assess scalability of

    the template and abstract ISA/microarchitecture specifics:

    hence, single core performance is shown as same for allarchitectures. Fig. 9a shows how MJPEG performance scales

    versus the number of cores. Near-linear speedup is

    observed for the proposed architecture template whencache sizes of 8 and 16 kB are employed. Reference NoC-

    based platforms show better performance when comparedto the reference shared bus-based system, which reaches a

    plateau from five cores due to the bus saturation. The same

    behavior is also observed with the proposed architecturetemplate when using cache sizes of 4 kB.

    In turn, Fig. 9b shows near-perfect linear speedup for the

    Smith-Waterman application, for both proposed and

    reference platforms. In turn, a maximum speedup of 6.3 isachieved in the reference shared-memory mesh platform.

    As being very compute oriented, few data communicationoccur and thread code fit in the local caches for all tested

    configurations, resulting in limited cache miss rate. Differ-

    ent from the previous benchmarks the FFT application wasemployed to explore the shared data protection when

    multiple threads trigger concurrent accesses to the same

    shared data/variable (e.g., during a read or write opera-tion). In this situation, synchronization primitives like

    Mutexes must be used to allow exclusive thread dataaccess and Barrier to coordinate the parallel execution of

    threads, during their synchronization. Thus, a Mutex

    variable is used to allow each thread to use a singlevariable in shared memory to identify them before starting

    their execution process.Fig. 9c shows the performance scalability of FFT

    application, regarding the adopted scenario (Fig. 7). Dueto the sequential synchronizations, the parallelization

    capability of FFT is low. Thus, using our platform with a16-kB cache configuration, the application can only achieve

    a speedup of 4.6. However, such results show the better

    performance when compared to the other referencesystems, which achieve a plateau from five cores. The LU

    scenario presented the worst scalability results, as shown in

    Fig. 9d. Excluding the proposed distributed memoryarchitecture with cache size of 16 kB that does not reach

    the saturation (application speedup of 4.1), others architec-tures reached a plateau from seven cores.

    To further platform explorations, we generated an 4� 4scenario, varying the host position. Three scenarios were

    explored:

    . Host placed at 0� 0: which comprises, in cartesianaddressing top-bottom corner positions (00, 03, 30,and 33). In this case, in the worst case three threadswill communicate using the east port and the other12 threads will communicate through the south port,making this the bottleneck.

    OST ET AL.: NOVEL TECHNIQUES FOR SMART ADAPTIVE MULTIPROCESSOR SOCS 9

    Fig. 8. Reference platforms.

  • . Host placed at 1� 1: comprising central mesh posi-tions (11, 12, 21, and 22), which reduce the south portcommunication from 12 to 8 threads, while decreas-ing NPU-Host distance to four hops.

    . Host placed at 1� 0: which comprises the remainingpositions. Due to the adopted XY routing algorithm,12 threads will communicate through the south port,while one and two threads through west and eastports, respectively.

    As displayed in Fig. 9e, the total gain using 15 threads isabout 3 percent by changing only the host position, this canbe explained by the fact that the MJPEG application isscalable and the NoC was not saturated when a 16-kB cachesize is employed, as will be detailed in Section 5.3.3. Thenext experiment (Fig. 9f) explores the use of multiple vSMPclusters. This scenarios comprises a 6� 4 platform orga-nized in three virtual clusters, where two 2� 2 clusters areused to execute FFT and MJPEG while a 4� 4 clusterexecutes the LU application. Jitter in thread execution timeremains below 5 percent.

    5.3.3 RMA and NoC Throughput

    Fig. 10 shows the average bandwidth usage during MJPEGexecution for the two used NoC links at the host level(south and east) as well as the RMA, which is theaggregated bandwidth of those. Thread mapping plays arole in the NoC usage, as explained in Section 2.2. It can beobserved that south link is unused for one and two threads,because of the decided mapping. Most data flows throughthe east host NoC link, due to the used mapping. For 8- and16-kB cache sizes, the bandwidth grows almost linearlyversus the number of threads. Using 4-kB cache sizesresults in a significant bump in bandwidth usage, mostly

    because of much increased instruction cache miss rate.A plateau is observed from four threads at around200 MB/s, which is about 80 percent of the maximumtheoretical bandwidth of the RMA module (250 MB/s). Thisexplains the plateau observed in speedup in Fig. 9a becauseof the RMA saturation.

    6 COMPARISON OF PROPOSED TECHNIQUES

    6.1 Performance Comparison

    Multithreading as presented in Section 5 allows for thread-level parallelism, hence at a finer grain compared to task-level parallelism. Beyond the obvious gains in program-mability, this permits lifting some performance bottlenecksshould a given task be critical in an application processingpipeline. To fairly assess the performance and drawbacks of

    10 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. X, XXXXXXX 2013

    Fig. 9. Architecture scalability, for cache sizes of 4, 8, and 16 kB (a), (b), (c), (d); scalability scenario regarding host position in a 4� 4 configuration(e); cluster exploration considering MJPEG (three threads), FFT (three threads), and LU (15 threads), executing simultaneously (f), where each bar

    represents the execution time of one worker thread.

    Fig. 10. Average bandwidth usage for MJPEG.

  • all proposed techniques, a case study 3� 3 processor systemcapable of running any of the proposed techniques has beenbuilt. This setup was experimented on a MJPEG processingpipeline comprising four tasks, with a defavorable initialmapping in which IVLC and IQuant, the two most computeintensive task are executed on a single processor node in atime-sliced mode. Six simulation runs, depicted in Fig. 11,are presented accounting for TM, TMR, RE, and multi-threaded execution from 1 to 3 worker threads, alongside amain thread which handles synchronizations and acts as awrapper.

    Fig. 12a shows the throughput obtained for eachtechnique triggered at the same time in all six runs. Thethroughput in all scenarios is maintained up until themigration trigger point. TM incurs a significant penaltydue to the required time to transfer task code, interruptingand later restoring data transfers. During this 10-ms period,TM throughput drops to about half of initial value; whileTMR shows a greater performance penalty lasting much

    less, therefore completing migration in about 4 ms thanks tothe data redirection strategy. RE negates observed perfor-mance penalty and instantly triggers execution on theremote node. Postmigration performance is, however, lessthan TM due to the latency incurred by remote instructionfetching. Concerning multithreaded implementations, sin-gle thread performance is similar to RE while two and threethread implementations outperform all other techniques, atthe expense of, respectively, one and two more processornodes used. Fig. 12b plots the buffer usage when the MJPEGpipeline is connected to a video DAC that consumes packetsat regular time intervals. Buffer occupation drops duringmigration as a function of the used technique. While TMrequires over 20 buffered packets for avoiding starvation,much less is required for all other techniques. Bufferoccupation then increases after migration as higherthroughput is achieved, plot crossing give an assessmentof the break-even point and, therefore, the payback time.

    Multithreading is here regarded as a complementarytechnique to TM, TMR and RE: in the example aboveshould a higher level of performance be required for thevideo decoding, multithreading proposes an alternative torepartitionning of the entire application into tasks, whichwould otherwise be needed. As multithreaded executionperformance is sensitive to available bandwidth on the NoCas presented in Fig. 10, several measures may be taken tolimit performance penalty, such as controlling the amountof through-traffic in the vSMP cluster.

    6.2 Proposed Architecture Area Overhead

    Area results were obtained for both FPGA and ASIC targetstargeting 40-nm CMOS process, with the following memorysizes: 1) 64 kB, 2) 128 kB, and 3) 256 kB. Target processorarchitecture has FPU, optional IO (e.g., debug, UART) wereremoved from synthesis.

    Table 4 shows area overhead related to multithreadingrange from 5.72 to 15.59 percent, using, respectively, 256and 64 kB of RAM. An additional analysis was performedusing a Xilinx Spartan-3 FPGA, RMA implementationincurs an resource overhead ranging from 7 to 15 percentdepending on the size of shared memory.

    7 CONCLUSION

    While most platform SoCs are today heterogeneous, ourapproach follows a different route and advocates the designof adaptive and scalable multiprocessor cores, sitting next toa traditional AP, that could compete performance andpowerwise with ISPs to their ability to adapt to time-changing scenarios. With the underlying motivation ofarchitecture scalability, this paper demonstrated on thebasis of a scalable and synthesizable HW/SW multiproces-sor template, several novel purely distributed adaptationtechniques that leverage the existing techniques for sucharchitectures:

    OST ET AL.: NOVEL TECHNIQUES FOR SMART ADAPTIVE MULTIPROCESSOR SOCS 11

    Fig. 11. Adopted scenarios.

    Fig. 12. Performance comparison for the four techniques.

    TABLE 4PE Area Evaluated at 40-nm CMOS Technology

  • 1. Distributed PID-based control, experimented onfrequency scaling has proven effective both in thehandling of transient perturbations and powerwise.This approach could be extended both in term ofalgorithm (use of Kalman filters) or control schemethat could be designed hierarchical such as thoseused in most complex control automation processes.

    2. As an alternative to task migration in distributedmemory systems, two techniques are presented:TMR and RE, the latter can be regarded as atechnique that makes load balancing in distributedmemory systems viable, because of incurring negli-gible latencies, similar to those in shared-memorymulticore systems.

    3. To widen the scope of load-balancing opportunitiesa low area overhead technique for multithreadingsupports based on DSM is proposed: It efficientlydraws strengths from the scalable, distributedmemory and on-chip features of our architecture:The low-memory access latencies make it possible toliterally dynamically grow a large number or threadsfor achieving maximum performance.

    Summarizing the above discussions, we believe that allthese techniques combined together make for a flexible

    template with a panel of techniques that permits exploiting

    different application profiles and use cases. Our currentwork lies in exploring the opportunities offered by this

    technique for harnessing the processing power of such

    dense multiprocessor cores from the AP for quasi-generalpurpose application acceleration contrary to typical hetero-

    geneous platform SoCs. APIs such as OpenMP and OpenCL

    are promising alternatives that will certainly render suchsolutions more attractive in the future.

    ACKNOWLEDGMENTS

    The research leading to these results has received funding

    from the European Communitys Seventh Framework Pro-

    gramme (FP7/2007-2013) under the Mont-Blanc Project:

    http://www.montblanc-project.eu, grant agreement no.

    288777.

    REFERENCES[1] S. Borkar, “Designing Reliable Systems from Unreliable Compo-

    nents: The Challenges of Transistor Variability and Degradation,”IEEE Micro, vol. 25, no. 6, pp. 10-16, Nov./Dec. 2005.

    [2] R. Busseuil et al., “Open-Scale: A Scalable, Open-Source NOC-Based MPSOC for Design Space Exploration,” Proc. Int’l Conf.Reconfigurable Computing and FPGAs, pp. 357-362, 2011.

    [3] LIRMM, “Computing, Adaptation & Related Hardware/Soft-ware,” http://www.lirmm.fr/ADAC, 2012.

    [4] G. Magklis et al., “Profile-Based Dynamic Voltage and FrequencyScaling for a Multiple Clock Domain Microprocessor,” SIGARCHComputer Architecture News, vol. 31, pp. 14-27, May 2003.

    [5] F. Xie et al., “Compile-Time Dynamic Voltage Scaling Settings:Opportunities and Limits,” SIGPLAN, vol. 38, pp. 49-62, May2003.

    [6] D. Puschini et al., “A Game-Theoretic Approach for Run-TimeDistributed Optimization on Mp-Soc,” Int’l J. ReconfigurableComputing, article 403086, 2008.

    [7] Q. Wu et al., “Formal Online Methods for Voltage/FrequencyControl in Multiple Clock Domain Microprocessors,” SIGARCHComputer Architecture News, vol. 32, pp. 248-259, Oct. 2004.

    [8] Y. Zhu and F. Mueller, “Feedback Edf Scheduling ExploitingHardware-Assisted Asynchronous Dynamic Voltage Scaling,”SIGPLAN Notices, vol. 40, pp. 203-212, June 2005.

    [9] U.Y. Ogras et al., “Variation-Adaptive Feedback Control forNetworks-on-Chip with Multiple Clock Domains,” Proc. 45th Ann.Design Automation Conf., pp. 614-619, 2008.

    [10] A. Sharifi et al., “Feedback Control for Providing QoS in NocBased Multicores,” Proc. Conf. Design, Automation and Test inEurope (DATE), pp. 1384-1389, Mar. 2010.

    [11] M. Ghasemazar et al., “Minimizing the Power Consumption of aChip Multiprocessor under an Average Throughput Constraint,”Proc. Int’l Symp. Quality Electronic Design (ISQED), 2010.

    [12] F. Clermidy et al., “An Open and Reconfigurable Platform for 4GTelecommunication: Concepts and Application,” Proc. 12th Euro-micro Conf. Digital System Design, Architectures, Methods and Tools(DSD ’09), pp. 449-456, 2009.

    [13] D. Puschini et al., “Dynamic and Distributed Frequency Assign-ment for Energy and Latency Constrained MP-SoC,” Proc. DesignAutomation and Test in Europe Conf. and Exhibition (DATE ’09),pp. 1564-1567, 2009.

    [14] M.A. Al Faruque et al., “Adam: Run-Time Agent-Based Dis-tributed Application Mapping for On-Chip Communication,”Proc. 45th Ann. Design Automation Conf. (DAC ’08), pp. 760-765,2008.

    [15] L.T. Smit et al., “Run-Time Mapping of Applications to aHeterogeneous Soc,” Proc. Int’l Symp. System-on-Chip (SoC ’05),pp. 78-81, 2005.

    [16] E.L. de Souza Carvalho et al., “Dynamic Task Mapping forMPSoCs,” IEEE Design & Test of Computers, vol. 27, no. 5, pp. 26-35, Sept./Oct. 2010.

    [17] M. Mandelli et al., “Energy-Aware Dynamic Task Mapping forNoC-Based MPSoCs,” Proc. IEEE Int’l Symp. Circuits and Systems,pp. 1676-1679, 2011.

    [18] A.K. Singh et al., “Communication-Aware Heuristics for Run-Time Task Mapping on NoC-Based MPSoC Platforms,” J. SystemsArchitecture, vol. 56, pp. 242-255, July 2010.

    [19] S. Wildermann et al., “Run Time Mapping of Adaptive Applica-tions onto Homogeneous Noc-Based Reconfigurable Architec-tures,” Proc. Int’l Conf. Field-Programmable Technology (FPT ’09),pp. 514 -517, Dec. 2009.

    [20] T. Streichert et al., “Dynamic Task Binding for Hardware/Software Reconfigurable Networks,” Proc. 19th Ann. Symp.Integrated Circuits and Systems Design (SBCCI ’06), pp. 38-43, 2006.

    [21] O. Derin et al., “Online Task Remapping Strategies for Fault-Tolerant Network-On-Chip Multiprocessors,” Proc. ACM/IEEEFifth Int’l Symp. Networks-on-Chip (NOCS ’11), pp. 129-136, 2011.

    [22] H. Shen and F. Petrot, “Novel Task Migration Framework onConfigurable Heterogeneous MPSoC Platforms,” Proc. Asia andSouth Pacific Design Automation Conf. (ASP-DAC), pp. 733-738,2009.

    [23] F. Mulas et al., “Thermal Balancing Policy for MultiprocessorStream Computing Platforms,” IEEE Trans. Computer-Aided DesignIntegrated Circuit Systems, vol. 28, no. 12, pp. 1870-1882, Dec. 2009.

    [24] S. Bertozzi et al., “Supporting Task Migration in Multi-ProcessorSystems-on-Chip: A Feasibility Study,” Proc. Design, Automationand Test in Europe (DATE ’06), vol. 1, pp. 1-6, 2006.

    [25] M. Pittau et al., “Impact of Task Migration on StreamingMultimedia for Embedded Multiprocessors: A QuantitativeEvaluation,” Proc. IEEE/ACM/IFIP Workshop Embedded Systems forReal-Time Multimedia (ESTImedia), pp. 59-64, 2007.

    [26] D. Barcelos et al., “A Hybrid Memory Organization to EnhanceTask Migration and Dynamic Task Allocation in Noc-BasedMpsocs,” Proc. 20th Ann. Conf. Integrated Circuits and SystemsDesign (SBCCI ’07), pp. 282-287, 2007.

    [27] L. Barthe et al., “The Secretblaze: A Configurable and Cost-Effective Open-Source Soft-Core Processor,” Proc. IEEE Int’l Symp.Parallel and Distributed Processing Workshops and Phd Forum(IPDPSW), pp. 310-313, May 2011.

    [28] F. Moraes et al., “Hermes: An Infrastructure for Low AreaOverhead Packet-Switching Networks on Chip,” Integration,VLSI J., vol. 38, no. 1, pp. 69-93, 2004.

    [29] S. Rhoads, “Plasma - Most MIPS I(TM),” http://www.opencores.org/project, plasma.

    [30] J.R. Levine, Linkers and Loaders. Morgan Kaufmann Publishers Inc.,1999.

    12 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. X, XXXXXXX 2013

  • [31] N. Saint-Jean, “Study and Design of Self-Adaptive MultiprocessorSystems for Embedded Systems,” PhD thesis, Univ. of Montpel-lier 2, France, 2008.

    [32] C. Muller et al., “Design Challenges for Prototypical andEmerging Memory Concepts Relying on Resistance Switching,”Proc. Custom Integrated Circuits Conf. (CICC), pp. 1-7, Sept. 2011.

    [33] T.F. Smith and M.S. Waterman, “Identification of CommonMolecular Subsequences,” J. Molecular Biology, vol. 147, no. 1,pp. 195-197, Mar. 1981.

    [34] N. Binkert et al., “The Gem5 Simulator,” SIGARCH ComputerArchitecture News, vol. 39, no. 2, pp. 1-7, Aug. 2011.

    [35] A. Ahmed, P. Conway, B. Hughes, and F. Weber, “AMD OpteronShared Memory MP Systems,” Proc. 14th HotChips Symp., Aug.2002.

    Luciano Ost received the PhD degree incomputer science from PUCRS, Brazil in 2010.He is an associate professor at the University ofMontpellier II and the member of microelectronicgroup at LIRMM, which is a cross-facultyresearch entity of the UM2 and the NationalCenter for Scientific Research (CNRS). From2004 to 2006, he was a research assistant inPUCRS. He is a member of the IEEE.

    Rafael Garibotti received the BSc degree incomputer engineer from PUCRS, Brazil, theMSc degree in microelectronics from EMSE,France, the MBA degree in project managementat FGV, Brazil, and is currently working towardthe PhD degree at LIRMM, France. He workedtwo years and a half as a digital ASIC designerat CEITEC S.A, besides having done an intern-ship at NXP semiconductors, France.

    Gilles Sassatelli is a senior scientist at LIRMM.He is currently the leader of the AdaptiveComputing Group composed of 12 permanentstaff researchers and over 20 PhD students. Hehas published more than 150 publications in anumber of renowned international conferencesand journals, and regularly serves as track ortopic chair in the major conferences in the areaof reconfigurable computing. He is a member ofthe IEEE.

    Gabriel Marchesan Almeida received the BScdegree in computer science from URI, Brazil, in2004, the MSc degree in computer science in2007, and the PhD degree in microelectronicsfrom the University of Montpellier II, France, in2011. He is currently a senior research scientistat Karlsruhe Institute of Technology (KIT), inGermany.

    Rémi Busseuil received the master’s degree inmicroelectronics in 2009 from the ENS Cachanschool, Paris, and the PhD degree in microelec-tronics from LIRMM, Laboratory of Informatics,Robotics and Microelectronics of Montpellier,France, in 2012. He is currently in a teachingposition in France. He is a member of the IEEE.

    Anastasiia Butko received the BSc and MScdegrees in design of electronic digital equipmentfrom NTUU “KPI,” Ukraine, and the MSc degreein microelectronics from UM2, France, and iscurrently working toward the PhD degree atLIRMM, France. She works in the area ofadaptive multiprocessor architectures for em-bedded systems in the Adaptive ComputingGroup.

    Michel Robert received the MSc and PhDdegrees in 1980 and 1987, respectively, fromthe University of Montpellier. From 1980 to1984, he was in the semiconductor division ofTexas Instruments as R&D engineer. From1985 to 1990, he was a professor at theUniversity of Montpellier 2, and LIRMM labora-tory. Since 2012, he has been the president ofthe University of Sciences of Montpellier,France.

    Jürgen Becker received the Diploma degree in1992, and the PhD degree in 1997, both fromKaiserslautern University, Germany. His re-search work focused on application developmentenvironments for reconfigurable accelerators,including hw/sw codesign, parallelizing compi-lers, and high-level synthesis. He is author ofmore than 250 scientific papers, published inpeer-reviewed international journals and confer-ences. He is a senior member of the IEEE.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    OST ET AL.: NOVEL TECHNIQUES FOR SMART ADAPTIVE MULTIPROCESSOR SOCS 13