ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1...

Confidential © ALMARVI Consortium of 27

ALMARVI “Algorithms, Design Methods, and Many-‐Core Execution Platform for Low-‐Power

Massive Data-‐Rate Video and Image Processing”

Project co-‐funded by the ARTEMIS Joint Undertaking under the ASP 5: Computing Platforms for Embedded Systems

ARTEMIS JU Grant Agreement no. 621439

Scalability, quality and usability of the execution platform D3.5

Due date of deliverable: March 31, 2016

Start date of project: 1 April, 2014 Duration: 36 months

Organisation name of lead contractor for this deliverable: TU Delft

Author(s): Zaid Al-‐Ars (TUDelft)

Validated by: Joost Hoozemans (TUDelft)

Version number: 1.0

Submission Date: March 31, 2016

Doc reference: ALMARVI_D3.5_final_v10

Work Pack./ Task: WP 3, Task 3.1

Description: (max 5 lines)

This report presents the quality aspects of the hardware configurations w.r.t. performance improvements, power/energy efficiency, scalability, and usability of the execution platform.

Nature: R=Report

Dissemination Level: CO Confidential, only for members of the consortium (including the JU)

D3.5 – Scalability, quality and usability of the execution platform ARTEMIS JU Grant Agreement n. 621439

rev 1.0 -‐ Confidential ©ALMARVI Consortium of 27

DOCUMENT HISTORY

Release Date Reason of change Status Distribution

V0.1 14/12/2015 Initial document organization draft CO

V0.2 11/2/2016 Contributions from TUD-‐UTIA-‐NOK-‐TUT-‐UTURKU added

draft CO

V0.3 13/3/2016 Merged contribution of partners draft CO

V0.4 28/3/2016 Introduction and conclusions updated draft CO

V1.0 31/3/2016 Submitted to Artemis final CO



Contents 1. Introduction ................................................................................................................................................. 4 2. Nokia/TUT/UTURKU platform ...................................................................................................................... 5

2.1. Performance improvements ...................................................................................................................... 6 2.2. Power/energy efficiency ......................................................................................................................... 11 2.3. Scalability ................................................................................................................................................ 12 2.4. Usability .................................................................................................................................................. 12

3. TUDelft platform ........................................................................................................................................ 14 3.1. Performance improvements ................................................................................................................... 14 3.2. Power/energy efficiency ......................................................................................................................... 15 3.3. Scalability ................................................................................................................................................ 16 3.4. Usability .................................................................................................................................................. 17

4. UTIA platform ............................................................................................................................................ 19 4.1. Performance improvements ................................................................................................................... 19 4.2. Power/energy efficiency ......................................................................................................................... 21 4.3. Scalability ................................................................................................................................................ 22 4.4. Usability .................................................................................................................................................. 24

5. Conclusions ................................................................................................................................................ 26 6. References ................................................................................................................................................. 27



1. Introduction

This report represents deliverable D3.5, which is part of Task 3.1 in WP3 of the ALMARVI project. D3.5 presents the quality aspects of the hardware configurations w.r.t. performance improvements, power/energy efficiency, scalability, and usability of the execution platform. This deliverable builds on D3.1, where the hardware platform solutions for image/video processing were initially described. D3.5 determines the quality of appropriate architecture configuration defined by number and types of cores and specialized instructions for image processing that may be reconfigured on FPGAs. Reconfigurations, adaptability and scalability of the hardware configuration are important issues when considering the cross-‐domain nature, variability in acceleration fabrics, and image/video workloads. Therefore, special attention will be on methods for adaptability, enabling the exchange of processing between the different elements in the configuration, to optimally use the properties of the hardware at hand. This deliverable also describes integration of heterogeneous acceleration fabrics, interconnect and protocols for the platforms, such that interfaces between the different kinds of hardware can be used in the most effective way. Configuration choices are application-‐/domain-‐specific to take into account quality issues like energy, quality of service, throughput, etc., while allowing for massive real-‐time low-‐power data processing. The partners involved in delivering ALMARVI execution platforms are NOK, TUT, UTURKU, TUDelft and UTIA. Section 2 describes the execution platform developed by NOK, TUT and UTURKU. Sections 3 and 4 describe the platforms of the TUDelft and UTIA, respectively.

Position of D3.5 in the context of the ALMARVI project within WP3

Contribution of D3.5



2. Nokia/TUT/UTURKU platform

As agreed in the plan for ALMARVI common system software stack, we have adopted OpenCL as a parallel programming language for the heterogeneous architecture. OpenCL allows the programmer to describe the program parallelism by expressing the computation in the Single Program Multiple Data (SPMD) style. In this style, multiple parallel work-‐items execute the same kernel function in parallel with synchronization expressed explicitly by the programmer. Another key concept in OpenCL is the work-‐group which collects a set of coupled work-‐items which might synchronize with each other. However, across multiple work-‐groups executing the same kernel there cannot be data dependencies. These concepts allow exploiting parallelism in multiple levels for a single kernel description; inside a work-‐item, across work-‐items in a single work-‐group and across all the work-‐groups in the work-‐space. At the highest level, the OpenCL command queues are a means to describe task level parallelism and to map the execution of a larger multi-‐kernel application to a heterogeneous system with various device types. This level will be evaluated in the Zynq demonstrator which supports both tailored devices implemented in the FPGA fabric, and an ARM CPU device. In this setup, the whole application can be functionally verified by writing an OpenCL host application that is executed in the ARM host or in any OpenCL-‐supported desktop environment. For the OpenCL implementation, the pocl-‐project is used as a basis. The open source OpenCL implementation is ported to the Zynq platform in a way that the whole demonstration setup can be controlled by means of a single pocl OpenCL context. Customized processors provide a middle ground between fixed function accelerators and generic programmable cores. They bring benefits of hardware tailoring to programmable designs, while adding new advantages such as reduced implementation verification effort. The hardware of customized processor is optimized for executing a predefined set of applications, while allowing the very same design being used to run other, close enough routines by switching the executed software in the instruction memory. The degree of processor hardware tailoring is dictated by the use case and the targeted product. In any case, the processor customization process is highly demanding and error-‐prone, with high non-‐recurring engineering costs. Moreover, as the design process of customized processors is usually iterative in nature, porting the required software program codes to new processor variations needs either assembly language rewrites or retargeting the compiler. One approach to simplifying the processor customization process is to compose the processor from a set of component libraries and other verified building blocks, thereby reducing the required verification effort. The software porting problem can be alleviated with automatically retargeted software development kits. TTA-‐Based Co-‐Design Environment (TCE), a processor design and programming toolset which is based on a processor template that supports different styles of parallelism efficiently. TCE enables rapid design of cores ranging from tiny scalar microcontrollers to multicore vector machines with a resource oriented design methodology that emphasizes reuse of components. We have selected a few key use-‐cases to challenge the TCE and pocl toolsets to design application specific designs to meet the requirements.

1) 4G LTE is a standard of high-‐speed, low latency, data for wide-‐area cellular communications, and builds upon the technologies developed by 3GPP project. The most demanding and compute intensive algorithms in modern radio receiver relate to signal detection and demodulation. MIMO technique employs multiple transmitter and receiver antennas for transmitting multiple parallel data streams. As the system employs M x N different paths for the signal higher diversity, more reliable communications and higher throughput is achieved. The expense to be paid is the higher receiver complexity with exponential complexity increase to the MIMO and modulation order. LTE provides several device categories up-‐to 600 Mbit/s downlink data rate with 4 MIMO layers. Design must be scalable so that different device categories could be supported with simple parameterization of the architectural template to rapidly produce processor designs with a tradeoff in area and performance. In addition, in run-‐time, depending on the channel conditions more robust demodulation algorithm could be selected in favor of demodulation speed. We have stated the power budget for the designed processor to be less than 1W in order for the design to be suitable to mobile handset use scenarios.

2) In second co-‐processor design we perform audio signal processing in a wearable, always-‐on type of a device. The processor implements audio signal processing algorithms such as IIR biquads, linear filters, spectral analysis and adaptive filters. The input sample rate of such systems is quite modest compared to wideband radio transceivers, but the use-‐case requires extremely low energy and power consumption due to the limited power supply and the small form factor of the planned end product. The design is optimized for two



use cases: active noise cancellation and “hearing-‐aid” –type of functionality for hearing enhancement. For battery based operation the power consumption of the designed processor should be lower than 1mW.

3) Third case is a custom processor targeted for running various Machine Learning algorithms, and is optimized for floating point calculations. It also uses the Transport Triggered Architecture (TTA) as the processor template and was designed by using TCE (TTA-‐based Co-‐Design Environment) tool chain. The functional units (FU) of the processor are tailored towards implementing a wide variety of machine learning elementary operations making it suitable for learning and classification parts of Tasks 2.3 and 2.4.

2.1. Performance improvements

LTE Receiver Compute capabilities can of course be designed for best possible performance in all conditions. That, however, can easily lead to receiver overdesign and excessive usage of energy in the computation units. It is quite straight forward to select the best algorithm to satisfy the minimal service requirements for instance depending on channel conditions, modulation order and user allocation. For a lower user allocation and low modulation order, more complex algorithms can always be used to guarantee the lowest possible bit error rate. This is essential at least in cell service boundary. In this situation data rate and modulation order are not limiting the usage of compute capability for more effective algorithm to improve bit-‐error rate. Closer to the serving base station, where the signal-‐to-‐noise ratio and the offered rate is high, simpler algorithms can be used because the signal-‐to-‐noise ratio is not limiting the throughput, but, on the other hand, the available compute capability limits the usage of more complex algorithms. In wireless multiple-‐input multiple-‐output (MIMO) transmission over fading channels maximum likelihood (ML) detection is desired to achieve low bit error rate performance. ML detection, however, involves exhaustive search over all possible digitally modulated symbols and complexity which grows exponentially with the rate. Therefore, in practical implementations ML detection is either approximated by an algorithm which limits the symbol search space or is completely replaced with linear detection, thus sacrificing equalizer performance with complexity. For this study we have selected the two currently most attractive MIMO detection algorithms: Minimum Mean Square Error (MMSE) equalizer and Layered Orthogonal Lattice Detector (LORD), the first being the representative of linear equalizers and the second a suboptimal ML equalizer with deterministic complexity (latency) and soft-‐output generation complexity which is linear to the number of transmission antennas. Both algorithms are very practical for implementation as they can be parallelized in many dimensions to utilize instruction-‐, vector-‐ and thread-‐parallel hardware, leveraging either parallelism in the algorithm itself or in the parallel, independent, subcarriers of the OFDM transmission. The processor core designed within the ALMARVI project, called LordCore, is based on the Transport Triggered Architecture paradigm and contains a 512-‐ bit wide SIMD datapath for high performance computation. The SIMD datapath can perform calculations with 32 of 16-‐ bit wide half precision floating point values in parallel. The core also contains 32-‐bit datapath for address and control calculations. The design is multicore-‐ready including a simplified synchronization hardware with a dual-‐core configuration test chip is being fabricated. Figure 1 illustrates the architecture of a single core. The architecture is scalable to also other SIMD widths and 8-‐ and 16-‐lane versions are also developed for lower performance usage such as LTE-‐M and also 64-‐lane version is considered. Each core is connected to three memories; an instruction cache, a local scratchpad memory, and a global data memory. The instruction cache feeds the instructions to a core. The cache is 128-‐bits wide as the instructions are 128 bits wide and has space for 1024 instructions, so the total size of the instruction cache per core is 16kiB. The cache is direct mapped with a cache line size of 32 instructions. There is no hardware based coherence in the instruction caches as they are read-‐ only from the point of view of the core. An external invalidate signal suffices for reprogramming.The local scratchpad memory is mapped to OpenCL private and local memory spaces. This memory is 512-‐bit wide with a single data port, allowing either one 512-‐bit read or one 512-‐bit write per clock cycle. The memory has the capacity of 32kiB, which was enough to store all the temporary data needed in the LORD and MMSE algorithms. The global data memory is on-‐chip, but outside the core, shared by all the cores of the chip and connected to the core through a shared AXI bus. Global memory access is considerably slower than the local scratchpad memory. Figure 2 shows the system architecture of a chip containing two LordCores.



Figure 1 Architecture of a single core

Figure 2 Multicore architecture

Table 1 Throughput performance numbers of LORD and MMSE algorithms running on the processor on different modes and core counts.

Algorithm layers nrx modulation single2core dual2core quad2coreLORD 2 2 QPSK 122.3 226.0 445.3LORD 2 2 161QAM 125.1 241.0 471.5LORD 2 2 641QAM 74.4 145.7 286.3LORD 2 4 QPSK 84.9 161.1 314.3LORD 2 4 161QAM 103.4 199.6 391.0LORD 2 4 641QAM 69.3 135.9 267.1MMSE 2 QPSK 217.9 380.1 703.1MMSE 2 161QAM 426.7 746.4 1382.0MMSE 2 641QAM 548.6 976.2 1849.7MMSE 2 2561QAM 570.9 1042.9 1979.2MMSE 4 QPSK 60.1 117.2 229.0MMSE 4 161QAM 119.3 232.4 454.1MMSE 4 641QAM 171.5 334.7 639.3MMSE 4 2561QAM 209.5 409.3 784.3



Table 2 Comparison to other software MIMO detectors

System Algorithm layers nrx Modulation Throughput(Mbps)Quadro'FX'1700[1] LORD 2 2 165QAM "16.8"Proposed(single5core) LORD 2 2 165QAM 125.1GeForce'560'TI'[2] 15way? 2 2 165QAM 834.1GeForce'560'TI'[3] 25way? 2 2 165QAM 402.2Proposed(single5core) MMSE 2 2 165QAM 426.7GeForce'560'TI'[2] 15way? 2 2 645QAM 183.5GeForce'560'TI'[2] 25way? 2 2 645QAM 92.4Proposed(single5core) MMSE 2 2 645QAM 548.6Proposed(single5core) LORD 2 2 645QAM 74.4Proposed(dual5core) LORD 2 2 645QAM 145.7Proposed(quad5core) LORD 2 2 645QAM 286.3Tesla'C1060[3] MTT 4 4 QPSK 284.7Proposed(quad5core) MMSE 4 4 QPSK 229.0Tesla'C1060[3] MTT 4 4 165QAM 120.0GeForce'560'TI'[2] 15way? 4 4 165QAM 782.5GeForce'560'TI'[2] 25way? 4 4 165QAM 386.1Proposed(single5core) MMSE 4 4 165QAM 119.3Proposed(quad5core) MMSE 4 4 165QAM 454.1Tesla'C1060[3] MTT 4 4 645QAM 12.0GeForce'560'TI'[2] 15way? 4 4 645QAM 230.7GeForce'560'TI'[2] 25way? 4 4 645QAM 115.9Proposed(single5core) MMSE 4 4 645QAM 171.5Proposed(quad5core) MMSE 4 4 645QAM 639.3 Audio Signal Processing In the second co-‐processor design we target the audio signal processing in a wearable, always-‐on type of a device. The processor is implementing audio signal processing algorithms such as IIR biquads, linear filters, spectral analysis and adaptive filters. The input sample rate of such systems is quite modest compared to wideband radio transceivers, but the use-‐case requires extremely low energy consumption. Audio processing is an inherent part of multimedia signal processing and cannot be neglected. The architecture of the audio demonstrator setup is depicted in Figure 3 and the actual processor design in Figure 4. Main use –cases for our demonstrations has been active noise cancellation and headphone transparency, which can only be achieved with low processing latency of the processor. In Figure 3, the in-‐ear and out ear microphones are sampled synchronously at 48kHz sample rate. Initially, target for 1/48000 s delay was set as the upper bound for the signal processing latency. For the synthesized processor we have a power budget of 1mW. The demonstrator is working real-‐time on the Zynq platform. The audio co-‐processor is synthesized onto the FPGA. ARM A9 processors in the Zynq platform are controlling the DSP through the shared memory and memory mapped registers of the custom audio DSP. Audio input as well as the A/D and D/A converters are seen as DSP as stream IO ports, and the mail loop of the software is executed at the rate as data is available in the stream I/O ports. The design depicted in Figure 4 is a very simple, 4-‐bus TTA processor, with support for 32-‐bit integer and single precision scalar arithmetics and containing two functional units for two way single precision vector arithmetics. Its I/O unit handles the streaming input and output. With this design we have achieved signal processing latency of 1/(8*48000), thus exceeding our initial target by a factor of 8.



Figure 3 Audio Signal Processing Setup

Figure 4. Audio Processor Design

Machine Learning TTA Ten data mining algorithms (C4.5, k-‐Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART) were identified for use in work package 2. Additionally, three more algorithms commonly used in practical data mining tasks were considered: logistic regression (classification), the linear regression, and the Fast Fourier Transform (FFT) commonly used in signal processing. Figure 5 lists a table of elementary operations on the left hand side, and each column corresponds to one of the 13 data mining algorithms. A bullet sign indicates the presence of an elementary operation in that algorithm. The bottom line indicates to which data mining task an algorithm belongs: CLA -‐ classification, REG -‐ regression, C&R -‐ classification and regression, CLU -‐ clustering, ASS -‐ association rule discovery, DSP -‐ digital signal processing. We can see from the table that vector operations take the priority. The first four operations receive by a large margin higher weighted score than the rest. These should have priority for optimization on the hardware level with application specific components and memory addressing.

In-ear L

In-ear R

Out-ear L

Spk L

Out-ear R

Spk R

AD0AD1AD2AD3DA1DA2

Custom DSP

IO IO

I D

AIN ARM A9



The processor is implemented with 5 transport busses divided into two groups. The high level organization of the core and the FUs are presented in Fig. x. The first group contains all integer units, load-‐store unit, and other miscellaneous units. The second group contains all of the floating point units. The integer units have three of the five transport busses reserved, while the floating-‐point units use the remaining two busses. The transport busses are not fully connected, as by optimizing the number of connections to the functional units, significant power savings can be gained. The interconnect network thus has been heavily optimized for running machine learning applications instead of general computing task. Additionally, the processor includes a Timing-‐Error replacement FU for increasing variance robustness (Almarvi Objective-‐4). The methodology here is based on having the system operate at a voltage and frequency point in which the timing of critical paths fails intermittently. These failed timing occurrences are detected by special latches and handled. Whether the target is power or energy savings, the detection and handling overhead has to be lower than the power savings resulting from the lower Vdd. Here, the error handling mechanism of the Timing-‐Error system is replacement. When an error is detected, the erroneous value is replaced with a predetermined safe value. The safe value is algorithm specific. For example, in the category of iterative algorithms working on probabilities, essentially maximizing (or minimizing) a probability metric by iterative means, the probability value from the previous algorithm iteration can be used as the safe value.

Figure 5 Analysis of elementary computations.



Figure 6 (left) Simplified high-level architecture view of the implemented TTA processor. Shown are the function units of the processor and on which group of transport bus they connect on. (Right) Chip photograph of the processor.

2.2. Power/energy efficiency

LTE Receiver Two-‐core version of the processor was synthesized with Synopsys Design Compiler version I-‐2013.12, using 28nm Fully Depleted Silicon On Insulator (FDSOI) process technology. Switching Activity Interchange Files (SAIF) were produced from Modelsim simulations and used for power estimation in Design Compiler. Only the active processing time for each test case was included in the SAIFs. The designs were synthesized having leakage and dynamic power optimization and clock gating enabled. The synthesis reported an achieved clock rate of 950 MHz with the nominal operating voltages and conditions. The estimated power consumption for the two core version is 137 mW (MMSE) and 163 mW (LORD), which are well below our targeted 1W performance boundary. Extrapolated power consumption for a four core version is about 270 mW, which would deliver us the targeted LTE device category 11 performance of 600 Mbit/s.

Audio Signal Processing The audio processor was synthesized with Synopsys Design Compiler version I-‐2013.12, using 28nm Fully Depleted Silicon On Insulator (FDSOI) process technology. Switching Activity Interchange Files (SAIF) were produced from Modelsim simulations and used for power estimation in Design Compiler. Only the active processing time for each test case was included in the SAIFs. The designs were synthesized having leakage and dynamic power optimization and clock gating enabled. The synthesis reported an achieved target clock rate of 49.152MHz leveraging a sub-‐threshold design methodology of University of Turku. The estimated power consumption for two core version is 300 µW, which is well below our targeted 1mW upper boundary.

Machine Learning TTA Figure 6 presents the detailed information of the implemented processor. The on-‐chip memory is foundry IP which was not optimized for low-‐voltage operation and is therefore situated in a separate voltage domain from the processor. To lower dynamic power usage in processor core, clock gating was also used. The processor is capable of operating at an average of 110μW when performing machine learning algorithms with a minimum of 5.3pJ/cycle and 1.8nJ/iteration for Incremental Bayes. Table 3 Implementation Details of the Processor

Process 28nm FDSOI CMOS

Core area 0.30mm2

Operating voltage Core 0.35V, memory 1.0V

FloatALU

FloatMUL

FloatDIV SQRT Sigmoid Float

Compare

ALU MUL SPI LSU Float-IntConv.

4xRegister

leTER



Operating frequency 20.6MHz

Instruction memory 2048 inst. (2048x84 bits)

Data memory 8kB (2048x32 bits)

FP performance 30.9 MFLOPS (1.5 MFLOPS/MHz)

Bus configuration 5 busses. Divided between integer (3 busses) and floating point FUs (2 busses)

Integer FUs ALU, multiplier

Floating point FUs ALU with multiplier, divider, compare unit, sqrt, sigmoid

Other FUs LSU, SPI IO unit, conversion unit (float-to- int, int-to-float), TER

Registers 2x 16x32bit registers 2x 8x32bit registers

2.3. Scalability

LTE Receiver The architecture of the processor was designed to be scalable. The memory architecture allows scaling to multiple cores, and the SIMD width of the processor can easily be scaled. 32 lanes was selected as the default SIMD width due relatively wide SIMD allowing more work done per instruction bit, minimizing the energy used for instruction fetch, while still keeping the individual cores relatively small to make them easier to synthesize. Smaller single-‐core 8-‐ or 16-‐lane versions could be used for smaller low-‐throughput systems such as LTE-‐M while multi-‐core 64-‐lane versions could be used to extend performance for future communication standards. The scalability of the architecture is proven by the throughput performance simulations of the designed architecture depicted in Table 1.

Audio Signal Processing The architecture of the processor was designed to be scalable, however having extremely tight power and latency budget. SIMD width of the processor can easily be scaled as well as the number of functional units. Initially the processor was designed to implement processing of the left and right channels separately. However, it was soon discovered left and right channels can be conveniently processed with 2-‐way vector units processing left and right channels in lower and upper parts of the vector with no increase in code size, thus a 2-‐wide SIMD operation set was added to the design.

Machine Learning TTA As the OS control described in D1.3 can also be implement with the TER system, the processor is voltage scalable from 0.35V to 1V with an approximate operating frequency of 750 MHz at 1V (the exact frequency cannot be measured due to IO restrictions.

2.4. Usability The processor architecture was tailored using the TTA-‐ based Co-‐design Environment (TCE) tools and its re-‐targetable OpenCL compiler [4], [5] based on the Transport Triggered Architecture (TTA) paradigm [6]. In transport triggered processors, the datapath buses are exposed to the programmer: the processor is programmed by scheduling the data transfers that take place. Actual operations (e.g., arithmetic or memory operations) are executed when a transport is made to specific “trigger port” of a function unit implementing the operation.



Due to making the high-‐level programming of the designed processors a priority, a key tool in TCE is its re-‐targetable software compiler, tcecc. The compiler uses the LLVM project as a backbone and pocl to provide OpenCL support. The frontend supports the ISO C99 standard with a few exceptions, most of the C++98 language constructs, and a subset of OpenCL C. Compared to traditional “operation-‐programmed” Very Long Instruction Word (VLIW) architectures, where the instruction set specifies operations, and data transfers occur as part of operations, the TTA programming model has the benefit that the register file bypasses are explicitly programmed (“software bypassing”), and all the operands of operations do not have to be read in the same clock cycle. Similarly, the computed results do not have to be read to the destination register file on the same cycle they are produced, and the result write to a register can be totally omitted of the result is bypassed directly to some another operation. This allows using smaller register files with less read and write ports. Because in TTA processors the register files and function units are fully decoupled from the rest of the architecture due to the customizable interconnection network and the data transport programming model, it is easy to design new processors in a “component based” manner. During the project it was found that the TTA paradigm works extremely well for wide SIMD datapaths. This is because SIMD instructions save instruction bits per operation, and thus instruction fetch unit power, typically a major pitfall of TTA and VLIW type processors, while the interconnection networks and simplified register files enabled by the TTA approach save power on the datapath side where most of the power of streamlined control unit SIMD/VLIW processors is typically spent.

LTE Receiver The algorithms were implemented using the OpenCL language. OpenCL allows using vector data types to execute same code on many SIMD lanes of the processor and also easy parallelization of the workload over multiple cores. Each subcarrier is executed in own vector lane, the algorithm is executed for 32 subcarriers in parallel per core. The OpenCL standard contains support for only 16-‐wide vector data types, but pocl was extended to support up to 32-‐wide vector data types. The group of 32 subcarriers that execute in one core at the time forms one single-‐work-‐item OpenCL work groups. Multiple of these work groups execute concurrently on the multiple cores of the processor. Pocl contains a simple work queue scheduler where each thread gets a new work group to execute after completing the first one. The more advanced application level command queue runtime reported in D4.4 was not utilized in this case as the focus was on single kernel performance. Changing the SIMD width for different versions of the processor requires relatively small changes to the program, just the SIMD data types need to be changed and the shuffle intrinsic calls modified.

Audio Processor The algorithms are implemented in C language. C was selected due to size of the application and possibility for lower level control than OpenCL. However, the same support for vector datatypes that is available in OpenCL C was used via a compiler extension available in the used Clang compiler. This was done to achieve code that can explicitly utilize SIMD units without using intrinsics or other less portable means.

Machine Learning TTA The processor is programmed with C.



3. TUDelft platform

This section presents the quality aspects of the hardware configurations w.r.t. performance improvements, power/energy efficiency, scalability, and usability of the TUDelft execution platform. The TUDelft platform used in ALMARVI is built around the rVEX reconfigurable VLIW processor. It is a VHDL implementation of the VEX architecture, where many of the architectural parameters have been implemented as generic and some as parameters that can be changed at run-‐time. As such, the processor is design-‐time configurable and run-‐time parametrizable. The architecture itself has been developed targeting media applications (image, video), and related architectures (i.e. the Lx/st200 series of processors by STMicroelectronics) have found widespread usage in set-‐top boxes and related media devices. In contrast to the commercial st200 series, the rVEX is a proof of concept showing the possibilities of dynamic (run-‐time) reconfigurations. It falls under the “Liquid Architectures” research theme of the TU Delft Computer Engineering Laboratory. The final aim of this theme is to develop and evaluate a platform that constantly adapts to the needs of the workload. This means that it must be able to provide high performance for single threads and high throughput for multiple threads (balancing Instruction Level Parallelism and Thread Level Parallelism – ILP and TLP). The rVEX platform supports this through the dynamic core in combination with a dynamic cache system. In total, the rVEX platform consists of the synthesizable VHDL designs of the core, cache and a number of peripherals. This system can be used either standalone or in conjunction with the GRLIB library to create a SoC with DDR controller and various other peripherals. On the software side, there is an interface tool that can connect to the core and provides the user with extensive debugging capabilities and full control of the processor. There are a number of different compilers that can target the rVEX, there is a port of binutils and GDB, there is basic Linux support (uCLinux with a NOMMU kernel), runtime libraries (the uCLibc C standard library and newlib) and an architectural simulator. These components will be explained in more detail in the usability section. The next sections discuss a number of improvements we have developed in the architecture and how they have impacted the performance of the platform.


Traditionally, code size has been a drawback of the VLIW design philosophy. As many of the techniques that are used to increase ILP (e.g. loop unrolling) increase the code size, this metric will usually not compare favorable for the VLIW in relation to RISC machines. The need for horizontal NOPs (No-‐Operations to fill unused issue slots when there is not enough ILP) increases this difference even more. The result is that VLIW processors usually require larger caches and more memory bandwidth to perform well. By implementing a new VLIW instruction encoding, the performance of the rVEX processor has been increased by up to a factor of three while maintaining compatibility with the processor’s dynamic parametrizability, as has been published in [4]. This has been achieved by removing the NOPs from the binary, which dramatically increases the effectiveness of the instruction caches (as can be seen in Error! Reference source not found.7). Here, the speedup is depicted when comparing processors with the new instruction encoding (stopbit) to the old instruction encoding (baseline). The right figure shows the differences in miss rates which causes the improvement. There are results for both the static (design-‐time reconfigurable) and dynamic (run-‐time reconfigurable) versions of the processor. The Powerstone embedded benchmark set has been used for the evaluations. The dots show results for each individual benchmark, the lines represent the average for the entire set. The average speedup is highest when using an instruction cache size of 4KiB (factor 3). The speedup decreases for larger cache sizes, but this is due to the small size of the benchmarks (the code sections of many benchmarks easily fit in the 32 KiB cache regardless of the used encoding).



Figure 7: Left, Speedups of the static and dynamic versions of the rVEX processor using the new instruction encoding. Right, Cache miss rate comparison of the old and new instruction encodings

The most important aspect about this improvement, is that it leverages components in the processor that were added to support dynamic reconfigurability. In other words, the cost of adding this instruction encoding is lower for our dynamic VLIW than for normal VLIWs. The result is that, while increasing performance, the difference in area and power utilization between a static VLIW and the rVEX (which started out to be quite considerable) is reduced as will be shown in the next section. These improvements are important for ALMARVI because it means that the size of the instruction memory needed to achieve a certain level of performance can be reduced greatly, which will amount to a large difference when scaling up the number of cores in a platform.

3.2. Power/energy efficiency As the memory subsystem utilizes a substantial fraction of a typical system’s energy, increasing the effectiveness of the caches also reduces energy utilization, as can be seen in Figure 8. The figure depicts the energy utilization of the rVEX core with caches and main memory running the same (Powerstone) benchmark set.

Figure 8: Cache miss rate comparison of the old and new instruction encodings The most interesting result is that the difference in energy utilization between the static and dynamic versions has been decreased, as can be seen in 9. This is because we were able to use a number of components that are needed for dynamic reconfigurability to support the sparse encoding scheme.

●●

●●●●

●●

●●

●●

●●●●

●

●●

●

●●●●

●

●

●● ●

●

●●●●

●

●

●●

●●●●

●●

●

●

●

●

●●●●

●

●

●●

●●

●●●●

●●●●

●●

●

●

●●

●●

●●

●

●

●●

●

●

●

●

●●●●●●●●●●●●●●

●

●

●●●●

●●

●

●

●●

●

●

●●●●●●●●●●●●●●

●

●

●●●●

●●

●

●

●●●●

●●●●●●●●●●●●●●

●

●

●●●●

●●

●

●

●●●●1

2

4

8

16

1 2 4 8 16 32Cache Size (KiB)

Spee

dup

● ●Dynamic Core Static Core

Fig. 5. Speedup for stop-bit implementation for different instruction cachesizes. The lines represent the average speedup for a particular cache size.

with sparse instruction encoding than without. This is becauseat those cache sizes the entire application fits in the cache,and the reduction in cache misses is offset by the penaltyof having a longer branch delay. This could be remedied byinserting alignment NOPs to ensure that branch targets arealways aligned, at the cost of an increase in cache misses.

In the same figure we also observe that for very smallinstruction cache sizes, the speedup is not as significant asit is for intermediate sizes. This is caused by the fact that thereduction in cache misses for intermediate cache sizes is farlarger than for small cache sizes, as seen in Fig. 4.

Fig. 6 shows the normalized execution times for the dy-namic core. The baseline is the average of the worst executiontime of each application individually, executing on the dy-namic baseline design with a cache size of 1KiB. This figureshows, for instance, that for smaller cache sizes, the dynamicstop-bit implementation performs equivalent to the dynamicversion without stop-bit with a cache between 2 and 4 timeslarger.

D. Energy Results

Fig. 7 presents the total energy consumed by each ofthe benchmarks. The lines show the geometric mean of allapplications at each cache size. We can see that for smallcache sizes, due to the additional hardware required to supportreconfiguration, the dynamic core consumes more energy thanthe static core. However, at large cache sizes the differentdesigns are closer together in terms of energy consumption.

Fig. 8 depicts the energy consumption of the dynamiccore relative to that of the static core (values greater than1 mean that the static version consumes less energy than thedynamic one). We can see that the baseline dynamic designconsumes far more energy at small cache sizes, whereas when

Fig. 6. Normalized execution times for the dynamic core

Fig. 7. Energy consumption for each of the benchmarks at different cachesizes.

using sparse instruction encoding the designs consume similaramounts of energy.

As one can observe, the huge difference in energy consump-tion between the static and dynamic versions is significantlydecreased when using the proposed stop-bit approach. Mostnotably, both processors consume approximately the sameamount of energy at larger cache sizes. It means that one cantake advantage of all the adaptability that the dynamic versionprovides, with limited additional costs in terms of energy.

V. CONCLUSION

In this paper, we extended the stop-bit technique for sparseinstruction encoding to a dynamically reconfigurable VLIWprocessor. We showed that, by implementing this technique,

TABLE IITHE RESOURCE USAGE ON THE FPGA FOR THE DYNAMIC CORE WITH

AND WITHOUT STOP-BIT IMPLEMENTATION.

Resource Original Stop-bit Increase

Registers 30153 30537 1.3%Luts 61927 62379 0.7%BRAMs 125 125 0.0%

IV. RESULTS

We evaluated four different versions of the processors: staticbaseline, static with stop-bit, dynamic baseline, and dynamicwith stop-bit — all of them in their 4-issue configurations.We use the 4-issue configurations to provide a fair comparisonbetween the static and dynamic cores. The difference betweendynamic and static versions is that the binaries for the formerare compiled as generic binaries. We considered instructioncache sizes ranging from 1KiB to 32KiB. These sizes werechosen so that at the largest cache size each of the programsfits in the instruction cache entirely.

The designs are implemented in VHDL and prototyped ona Xilinx Virtex 6 FPGA (ML605 Development board). Withthese prototypes, we use performance counters to determinethe number of cache accesses, misses, and the number ofrunning cycles. The cache stall time is 16 cycles per 4-bytebus access. We use the Cadence Encounter RTL Compilerto obtain power dissipation in ASIC (Application SpecificIntegrated Circuit), using a 65nm CMOS cell library fromSTMicroeletronics. The energy consumption of the memorysubsystem was calculated with the Cacti Tool [18].

We use applications from the Powerstone benchmarks [19].All sources are compiled with the HP VEX compiler [20]and assembled with either the ⇢-VEX port of GNU as, or ourmodified version of the st200 assembler. The dynamic stop-bit versions are assembled with alignment turned off, so thatinstruction bundles are not padded at all. Since the processorlacks floating point operations, we use the floatlib libraryincluded with the HP VEX compiler (based on BerkeleySoftFloat [21]).

A. FPGA Resource usageTable II shows the resource usage of the dynamic core on the

FPGA. It shows that the increase is only 1.3% for the numberof registers and 0.7% for the number of lookup tables. As wewill show in the following sections, with this small increasein area we achieve significant improvements in performance,energy, and code size.

B. Code Size Reduction and Instruction Cache Miss RateIn Table III, we show the reduction in code size for each of

the 14 benchmarks used. We can see that the average reductionis around 50%. The reductions for the dynamic core in 8-wayconfiguration are included for reference, and are even moreextreme. These reductions will impact the cache behavior. InFig. 4, we show the cache miss rates for the two differentcores with and without sparse instruction encoding. The results

TABLE IIITHE CODE SIZE REDUCTION FOR EACH OF THE BENCHMARKS.

Program code size reductionstatic dynamic dynamic

4-way core 4-way core 8-way core

adpcm 49% 48% 73%bcnt 35% 38% 64%blit 47% 45% 67%

compress 53% 51% 74%crc 48% 48% 71%des 42% 44% 68%

engine 57% 54% 77%fir 60% 54% 76%

g3fax 58% 55% 76%jpeg 53% 51% 73%

pocsag 55% 51% 74%qurt 67% 65% 82%

ucbqsort 57% 54% 76%v42 56% 53% 75%

average 53% 51% 73%

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● 0.02

0.06

0.25

1.00

4.00

16.00

64.00

1 2 4 8 16 32Cache Size (KiB)

ICac

he M

iss

Rat

e (%

)

●

●

●

●

Dynamic Baseline

Static Baseline

Dynamic Stopbit

Static Stopbit

Fig. 4. Cache miss percentage for the dynamic and static cores, both withand without sparse instruction encoding for different instruction cache sizes.The dots represent the individual benchmarks, whereas the lines represent theaverage miss percentage for a particular configuration.

show that both designs achieve a similar reduction in cachemisses. In fact, with sparse instruction encoding the miss ratesare similar to those of canonical encoding with a cache almostfour times as large. This might seem like a larger improvementthan expected, since the code size was only reduced by half.However, because loops account for a majority of the executedinstructions, code size reduction that allow an entire loop bodyto fit into the cache will have a disproportionate impact on thecache miss rate.

C. Execution Time

Fig. 5 shows the speedup in execution time achieved forboth the dynamic and static cores. We can see that for largercache sizes, the execution time of some benchmarks is larger

Fig. 5. Speedup for stop-bit implementation for different instruction cachesizes. The lines represent the average speedup for a particular cache size.

with sparse instruction encoding than without. This is becauseat those cache sizes the entire application fits in the cache,and the reduction in cache misses is offset by the penaltyof having a longer branch delay. This could be remedied byinserting alignment NOPs to ensure that branch targets arealways aligned, at the cost of an increase in cache misses.

In the same figure we also observe that for very smallinstruction cache sizes, the speedup is not as significant asit is for intermediate sizes. This is caused by the fact that thereduction in cache misses for intermediate cache sizes is farlarger than for small cache sizes, as seen in Fig. 4.

Fig. 6 shows the normalized execution times for the dy-namic core. The baseline is the average of the worst executiontime of each application individually, executing on the dy-namic baseline design with a cache size of 1KiB. This figureshows, for instance, that for smaller cache sizes, the dynamicstop-bit implementation performs equivalent to the dynamicversion without stop-bit with a cache between 2 and 4 timeslarger.

D. Energy Results

Fig. 7 presents the total energy consumed by each ofthe benchmarks. The lines show the geometric mean of allapplications at each cache size. We can see that for smallcache sizes, due to the additional hardware required to supportreconfiguration, the dynamic core consumes more energy thanthe static core. However, at large cache sizes the differentdesigns are closer together in terms of energy consumption.

Fig. 8 depicts the energy consumption of the dynamiccore relative to that of the static core (values greater than1 mean that the static version consumes less energy than thedynamic one). We can see that the baseline dynamic designconsumes far more energy at small cache sizes, whereas when

Fig. 6. Normalized execution times for the dynamic core

Fig. 7. Energy consumption for each of the benchmarks at different cachesizes.

using sparse instruction encoding the designs consume similaramounts of energy.

As one can observe, the huge difference in energy consump-tion between the static and dynamic versions is significantlydecreased when using the proposed stop-bit approach. Mostnotably, both processors consume approximately the sameamount of energy at larger cache sizes. It means that one cantake advantage of all the adaptability that the dynamic versionprovides, with limited additional costs in terms of energy.

V. CONCLUSION

In this paper, we extended the stop-bit technique for sparseinstruction encoding to a dynamically reconfigurable VLIWprocessor. We showed that, by implementing this technique,



In short, each pair of datapaths (lanepair) of the dynamic core is able to function as a full separate core to support dynamic reconfigurations. Because of this, each lanepair is able to execute the full instruction set, in contrast to the static core, where each datapath is specialized. For this reason, in case of the static core, instructions need to be forwarded to a datapath that is able to execute them. This requires additional dispatch circuitry that increases energy utilization (this is why the line that depicts the energy utilization of the static stopbit design to increase when using larger cache sizes, crossing the other lines). In the dynamic core, this instruction dispersal is not necessary because each lane can execute each instruction. The locations of the functional units with lanepairs is coordinated with the assembler. The result is that the overhead of adding dynamic reconfigurability to the rVEX is reduced, as can be seen in Figure 9.

Figure 9: Difference in energy utilization between the dynamic and static rVEX cores. Without the new instruction encoding scheme, the dynamic core used up to a factor of 3.5 times more power compared to a static core. By virtue of using some of the additional logic, that is needed for dynamic reconfigurability, for the new encoding scheme, this difference has been reduced.

3.3. Scalability When evaluating scalability of the platform, two factors need to be taken into consideration. Firstly, the rVEX is a proof of concept that has never been taped-‐out yet. A project in this direction is in its earliest stages. Therefore, the most appropriate area utilization results come from FPGA synthesis tools. ASIC synthesis tools have been used to generate estimations (these have been used to calculate the energy utilization figures), but until the chip has been taped-‐out and verified working, these number will remain estimations. Secondly, the rVEX processor is both run-‐time parametrizable and design-‐time configurable (using VHDL generics). Therefore, scalability is a metric in the design-‐space when choosing the right parameters for the application. When creating a general-‐purpose platform that must be able to provide high performance for single threads and high throughput for multiple threads, the full dynamic 8-‐issue core can be used. However, this version will not provide a large degree of scalability (approximately 64 datapaths can fit on a Xilinx VC707 FPGA development board). On the other hand, if the workload is highly parallelizable and scalability is an important requirement, a static 2 or 4-‐issue core can be used. The area utilization is considerably lower, and single-‐thread performance can be sacrificed because the workload will rely on multithreading to achieve high performance. In this case, the reduction in area results in improved scalability (approximately 100 datapaths can fit on the same FPGA). In both cases, the memory hierarchy is not considered as any processor’s scalability will be impeded by the ability of the memory to provide bandwidth to increasing numbers of cores. One of the short term goals for 2016 at TU Delft is to design a platform where the memory structure can be configured at design-‐time to match the memory access patterns of some of the ALMARVI image processing algorithms. This structure will include local memories whenever possible, so cores can stream data between computation stages without needing to access a shared bus. The expectation is that this memory structure will improve scalability considerably.



3.4. Usability On a usability level, the rVEX platform has matured immensely since the start of the ALMARVI project. This section gives an overview of the different aspects.

Hardware In 2015, a new core has been redesigned top-‐down with the following requirements (most have been included with usability in mind):

• Precise trapping and interrupts

• Advanced debugging hardware and software

• Hardware tracing functionality and performance counters

• Core structure and pipeline organization (easy) design-‐time configurable

• (Easy) Design-‐time extensibility (through instruction set extensions)

• Run-‐time parametrizable core (number of execution lanes)

• Dynamic cache that supports the varying number of execution lanes of the core The design supports the ML605 and VC707 FPGA boards and can be synthesized using ISE or Vivado. There are versions with and without bus and peripheral (using the GRLIB VHDL library), with and without caches, static and dynamic cores. All of these options are easily design-‐time configurable. The peripherals from GRLIB that are used in the platform are the interrupt controller, timer, framebuffer, and DDR memory controller. The core is able to utilize either UART or the newly created PCI express interface to connect to a host machine. This is depicted in Figure 10, that shows the system components including the additions that were necessary to support OpenCL via PCIe.

Figure 10: Components added to the rVEX platform to support the PCI express interface

Interface, Debug support Both interfaces are supported by a tool that is able to access the core for various debug purposes and advanced control of the core. The tool can access memory, the full state of the core (including the general purpose register file and a wide array of control registers such as the Program Counter) and all of the debug functionality. The platform



supports many standard debugging features such as breakpoints/watchpoints, stepping, register and memory readouts. Additionally, a program can be traced by hardware, where all relevant information of every execution cycle is reported to the host machine for analysis. These traces can be annotated with the disassembly of the program to monitor the full execution (instruction fetch, register reads, result writeback, cache hits/misses, etc.). These traces can for example be used to compare to the trace output of an architectural simulator. Lastly, the rVEX interface tool supports connections from GDB. We have added the rVEX architecture to GDB, it is available in our binutils-‐gdb port.

Compilers, runtime, libraries There are multiple compilers available on the rVEX platform. The standard compilers are HP VEX (a closed source descendent of the Multiflow compiler), GCC, Open64 (opensource, ported to st200 by STMicroelectronics, modified for VEX), LLVM and CoSy. There are currently 2 possible choices for the run-‐time environment (besides running bare-‐metal); uCLinux with uCLibc, and newlib with a compile-‐time generated filesystem. Newlib can be used to run large programs, such as the SPEC benchmark suite, as long as the in and output files and their sizes are known at compile-‐time. uCLinux (Linux for microcontrollers, a distribution of Linux with a nommu kernel) has a filesystem size limitation of 4MiB because we are using a RAMdisk. Currently supported libraries are a basic math library included in newlib and a floating point library. It has a decent performance of 23 cycles for a floating point multiplication on a 4-‐issue rVEX core. However, the rVEX does not target floating point workloads and the image processing algorithms used within the context of the ALMARVI project will all be converted to fixed point before targeting the rVEX platform.

OpenCL Support Figure 91 shows a graphical representation of the software stack to support OpenCL on the rVEX platform using pocl. The rVEX device layer has been added to the pocl project. It uses the newly developed rvex and xdma driver to connect to the hardware. This setup can also be used on the ZYNQ platform (using AXI instead of PCIe), so it can be included in the demonstrator setup.

Figure 91: OpenCL support on the rVEX platform



4. UTIA platform

Performance improvements and scalability of Full HD video processing algorithm will be presented on a motion detection algorithm. The device works with Full HD color video sensor and stores in memory two subsequent video frames. The algorithm performs edge detection on both frames. Two Sobel filters are used. Algorithm computes the difference of the output of both algorithms to detect the moving edges. These moving edges are filtered by a median filter to remove the noise. The filtered edges are marked by red color and displayed together with the original content of the video frame on the Full HD HDMI display. The algorithm is designed debugged and tested first on ARM Cortex A9 processor (666 MHz). Processor is capable to compute only one frame per second with maximal (03) optimization and the ycbcr format (16bit per pixel) of representation of data. The application requires to compute the moving edge detection with the Full HD video frame rate 60 FPS. This indicates need to:

• Accelerate 60 times • Scale the computation on multiple boards to reach this requirement (60 FPS) • Reach (if possible) the required 60 FPS with improved quality (by higher precision RGB format 24bit per pixel) • Use the Software defined (C/C++) description of the algorithm from the initial working (but slow) ARM Cortex

A9 implementation To reach this goal we build on concepts and tool chains for automatic generation of HW accelerators described in deliverable D3.2 for the UTIA platform. The used tool chain is briefly summarized now. Generation of accelerators is based on automatic compilation of C/C++ functions by Xilinx high level synthesis (HLS) compiler 2015.4 into IP cores for the programmable logic of ZYNQ devices. The Xilinx SDSoC environment 2015.4 environment is complementing the HLS generated IP cores with data movers serving for DMA transport of data from DDR3 to programmable logic. This compilation can be done in the Xilinx SDSoC environment 2015.4 as described in D3.2 for the UTIA platform.


In case of the Full HD motion detection algorithm we have to accelerate approximately 60 times the 666MHz ARM processor on the programmable logic part (PL) of the Xilinx ZYNQ device. The PL logic can implement IP cores clocked at 150 MHz. It is clear that the implementation has to relay on parallel processing capabilities of HW and also on chaining of accelerators to achieve parallel pipelined computation. The Xilinx SDSoC (LLVM based) compiler performs source code transformation replacing pointer arguments of functions with new interfaces controlling the auto-‐generated datamover IPs. The datamovers control the autogenerated HW DMA engines connected to the high performance ports of the ARM Cortex A9 programmable subsystem of the ZYNQ device. The DMA access to the DDR3 is supported by the multiport DDR3 controller of the ZYNQ device.



Figure 12 Concept of generation accelerated HW in SDSoC environment for the Almarvi platform

Figure 2 is explaining how the Xilinx SDSoC 2015.4 HW platform as it is used in Almarvi. It is reproduced from D3.2 deliverable for easier access here. The I/O Interface IPs have to be prepared by HDL HW designer in advance for the concrete HW system. In case of the Almarvi Full HD Video image sensor platforms the “Interface IPs” transport data from the input Full HD image sensor to the input DDR3 video frame buffer and also transport data from the output DDR3 video frame buffer to the full HD HDMI output. See Figure 103. Details about export of this platform to the Xilinx SDSoC environment are described in D3.2.

Figure 103 Almarvi vita-hdmio platform for the SDSoC environment

Platform described in Figure 103 decuples the Cortex A9 ARM computation from the demanding Full HD resolution video data stream I/Os. Software debugging can be done on ARM while the real-‐time Video in and out streams are processed by the vita-‐hdmio platform HW. See Figure 103. Figure 114 is describing the motion detection algorithm as described in the initial section of this chapter.



Figure 114 Motion detection algorithm is highlighting the moving edges in Full HD Video stream coming from the video sensor

4.2. Power/energy efficiency

Figure 125 Accelerated motion detection algorithm implemented by UTIA

Figure 125 is presenting the running edge detection algorithm implemented first in Arm and next on the SDSoC generated accelerator chain presented in Figure 114. ARM (666MHz) See top SW path in Figure 114.: 1,17 FPS



SDSoC generated accelerators (150MHz) See HW path in Figure 114.: 36,95 FPS This is acceleration 31 times. Energy needed by the complete board to compute the Full HD motion detection frame is reduced 30 times. This is still not sufficient, for the 60 FPS requirement.

4.3. Scalability We have performed additional design exploration in the in the Xilinx SDSoC 2015.4 environment. We have found that the ZYNQ PL part can accommodate two parallel HW chains chain presented in Figure 114. 2 parallel chains of SDSoC generated accelerators (150MHz) deliver: 57,09 FPS This is acceleration 48 times. Energy needed by the complete board to compute the Full HD motion detection frame is reduced 45 times. This is still not sufficient, for the 60 FPS requirement. The utilization of PL slices is close to 100% and the is valid for the 16 bit per pixel ycrcb data representation as defined by the Almarvi platform vita-‐hdmio defined in Figure 103. The solution with improved precision and RGB 24bit per pixel data representation would not fit to the device in case of 2 parallel chains of SDSoC generated accelerators (150MHz). To reach the required 60 FPS we have to scale up to computation on 2 boards. To reach this goal, we have created the Almarvi hdmii-‐hdmio platform for the second board. See Figure 136.

Figure 136 Almarvi hdmii-hdmio platform for the SDSoC for serial scalability of computation on chain of boards with communication based on the Full HD HDMI standard

We have also created two new Almarvi platforms vita-‐rgb-‐hdmio (Figure 147) and hdmii-‐rgb-‐hdmio (Figure 158) supporting the RGB 24bit per pixel format fir the video data.



Figure 147 Almarvi vita-rgb-hdmio platform with extended RGB 24 bit precision

Figure 158 Almarvi hdmii-rgb-hdmio platform with extended RGB 24 bit precision for serial scalability of computation on chain of boards with communication based on the Full HD HDMI standard

The data motion algorithm has been is slightly modified for each board.

• The first board with the video sensor computes the upper 50% of each frame and performs copy of the unmodified lower 50% of each frame. See Figure 16.

• First board is capable to compute its upper 50% part of each frame at the required 60 FPS. • The second board receives from the first board Full HD frames with 60 FPs the upper 50% of each frame is

already done. The board performs copy of this upper region without modification to the output frame. Board computes the lower 50% of each frame, also with the required with 60 FPS speed. See Figure .



Figure 169 Motion detection algorithm scaled for serially connected boards – Algorithm in the first video sensor board with the Almarvi vita-hdmio platform or vita-rgb-hdmio platform (with extended precision)

Figure 20 Motion detection algorithm scaled for serially connected boards – Algorithm in the second board with the Almarvi vita-hdmio platform or vita-rgb-hdmio platform (with extended precision)

4.4. Usability We have scalable system which can be reasonably used now. See Figure 17.

• It is 2 board system connected serially via the Full HD HDMI standard • The scaled up 2-‐board system have met the 60 FPS requirements for the Full HD motion detection algorithm. • Boards have sufficient PL area reserve for both data formats, the compact ycrcb (16 bps) as well as for the

RGB (24 bps) data representation. • Scaled system accelerates 60 times as required. • Energy needed by the scaled 2-‐board solution to compute the Full HD motion detection frame is reduced

30 times in comparison to the single board SW solution on Arm. • Power consumption and cost of the scaled-‐up 2-‐board system has increased 2x in comparison to the single

board solution (single board 6W; scaled-‐up 2-‐board syystem – 12W). • The glass to glass latency (from the video sensor to the display) has increased 2x due to the pipelined

processing. • Scaled up system remains manageable and it can be supported by the automatic generation of accelerators

as described in the Almarvi deliverable D3.2.



Figure 171 Motion detection algorithm scaled from single board to two serially chained boards is meeting the Full HD requirement resolution and the 60 frames per second performance.

The flower movement is detected by the scaled up algorithm working on the two serially connected ZYNQ boards. It is marked in the figure as the red edges of moving parts of the flower. See Figure 171. These packages have been prepared in UTIA for the Xilinx SDSoC 2015.4 environment and the ZYNQ te0720-‐02-‐2IF system on module on the te701-‐05 carrier board (see Figure 171):

• vita-‐hdmio with support for the Full HD (60 FPS) color Vita 2000 video sensor and Full HD (60 FPS) HDMI output) with internal video data representation as ycrcb (16 bits per pixel). It is supporting the Imageon FMC board.

• hdmii-‐hdmio with support for Full HD (60 FPS) HDMI input with internal video data representation as ycrcb (16 bits per pixel). It is supporting the Imageon FMC board.

• vita-‐rgb-‐hdmio with support for the Full HD (60 FPS) color Vita 2000 video sensor and Full HD (60 FPS) HDMI output) with internal video data representation as RGB (24 bits per pixel). It is supporting the Imageon FMC board for the Video sensor. It is supporting the Digillent FMC board for HDMI-‐input and the te701-‐05 carrier HDMI output.

• hdmii-‐rgb-‐hdmio with support for Full HD (60 FPS) HDMI input with internal video data representation as ycrcb (16 bits per pixel). It is supporting the Digillent FMC board for HDMI-‐input and the te701-‐05 carrier HDMI output.

The derived solution is usable and scalable with reasonable additional effort. See Figure 171.



5. Conclusions

The ALMARVI executions platforms cover a wide spectrum of flexibility and customizability for the ALMARVI applications. This deliverable described the various quality aspects of the three ALMARVI-‐specific hardware platform configurations developed by Nokia, TUT and UTURKU; by the TUDelft; and by UTIA. Four quality aspects for each platform have been discussed: performance improvements, power/energy efficiency, scalability, and usability. In the following, a number of these aspects are discussed for a couple of applications on the platforms to show the ALMARVI targets that have been achieved so far. Nokia/TUT/UTURKU platform With regards to performance, the processor core designed within the ALMARVI project (called LordCore) is based on the Transport Triggered Architecture (TTA) paradigm. To enable high-‐performance computation, it contains a 512-‐bit wide SIMD datapath, which is able to calculate 32 lanes of 16-‐bit wide half-‐precision floating-‐point values in parallel. The architecture is also scalable to other SIMD widths: 8-‐ and 16-‐lane versions were also developed for lower performance usage such as for LTE-‐M applications. With regard to efficiency, the two-‐core version of the processor for the LTE receiver was synthesized with Synopsys Design Compiler version I-‐2013.12, using 28nm Fully Depleted Silicon-‐On-‐Insulator (FDSOI) process technology. The estimated power consumption for the two-‐core version is 137 mW (MMSE) and 163 mW (LORD), which are well below the targeted 1W performance boundary. Extrapolated power consumption for a four-‐core version is about 270 mW, which would deliver the targeted LTE device category 11 performance of 600 Mbit/s. For the audio signal processor, the estimated power consumption for two-‐core version is 300 µW, which is well below our targeted 1mW upper boundary. With regard to scalability, the architecture of the processor was designed to be scalable in various ways. Most notably, the memory architecture allows both efficient scaling to multiple cores, as well as to multiple SIMD widths of the processor, and the number of functional units. With regard to usability, the platform can programmed using the OpenCL language for the LTE receiver. OpenCL allows using vector data types to execute the same code on many SIMD lanes of the processor, and also allows easy parallelization of the workload over multiple cores. The audio processor, on the other hand, can be programmed in the C language. C was selected due to the size of the application and possibility for lower level control than OpenCL. TUDelft platform By implementing a new VLIW instruction encoding, the performance of the rVEX processor has been increased by up to a factor of three while maintaining compatibility with the processor’s dynamic parametrizability. With regards to scalability, the rVEX processor is both run-‐time parametrizable and design-‐time configurable (using VHDL generics). Therefore, scalability is a metric in the design-‐space when choosing the right parameters for the application. On the usability level, the rVEX platform has matured immensely since the start of the ALMARVI project. In 2015, a new core has been redesigned with various improvements: 1) advanced debugging hardware/software, 2) hardware tracing functionality and performance counters, 3) core structure and pipeline organization are now (easily) design-‐time configurable, 4) core is now (easily) design-‐time extensible, 5) number of execution lanes are run-‐time parametrizable, and 6) dynamic cache that supports the varying number of execution lanes of the core. UTIA platform With regard to performance and to allow fro real-‐time processing, the Full-‐HD motion detection algorithm had to be accelerated approximately 60 times using the 666MHz ARM processor on the programmable logic part (PL) of the Xilinx ZYNQ device. The PL logic can implement IP cores clocked at 150 MHz. UTIA performed design exploration in the in the Xilinx SDSoC 2015.4 environment. Two parallel chains of SDSoC managed to generate accelerators (at 150MHz) delivering 57.09 FPS. This represents an acceleration of 48 times. The energy needed by the complete board to compute the Full-‐HD motion detection frame was reduced 45 times.



6. References

[ 1] T. Nylanden, J. Janhunen, O. Silven, and M. Juntti, “A gpu implementation for two mimo-ofdm detectors,” in Embedded Computer Systems (SAMOS), 2010 International Conference on, July 2010, pp. 293–300.

[ 2] M. Wu, B. Yin, and J. Cavallaro, “Flexible n-way mimo detector on gpu,” in Signal Processing Systems (SiPS), 2012 IEEE Workshop on, Oct 2012, pp. 318–323.

[ 3] M. Wu, Y. Sun, S. Gupta, and J. R. Cavallaro, “Implementation of a high throughput soft mimo detector on gpu,” J. Signal Process. Syst., vol. 64, no. 1, pp. 123–136, Jul. 2011. [Online]. Available: http://dx.doi.org/10.1007/s11265-010-0523-4

[4] A. Brandon, J. Hoozemans, J. van Straten, A. Lorenzon, A. Sartor, A. C. S. Beck, and S. Wong, “A sparse vliw instruction encoding scheme compatible with generic binaries,” in 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Dec 2015, pp. 1–7

ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1...

Documents

Transcript of ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1...