[IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL) -...

7
Comparing Soft and Hard Vector Processing in FPGA-based Embedded Systems Ahstct- Soh Jun Jie School of Computer Engineering Nanyang Technological University 50 Nanyang Avenue, S639798 Email: j[email protected] Soft vector processors can augment and extend the capability of embedded hard vector processors in FPGA-based SoCs such as the Xilinx Zynq. We develop a compiler framework and an auto-tuning runtime that optimizes and executes data-parallel computation either on the scalar ARM processor, the embedded NEON engine or the Vectorblox MXP soft vector processor as appropriate. We consider computational conditions such as pre- cision, vector length, chunk size, 10 requirements under which soft vector processing can outperform scalar cores and hard vector blocks. Across a range of data-parallel benchmarks, we show that the MXP soft vector processor can outperform the NEON engine by up to 3.95 x while saving 9% dynamic power (0.1 W absolute). Our compilation and runtime framework is also able to outperform the gcc NEON vectorizer under certain conditions by explicit generation of NEON intrinsics and performance tuning of the auto-generated data-parallel code. I. INTRODUCTION Embedded computing SoCs (systems-on-chip) increas- ingly support a heterogeneous assembly of p,architectures and co-processors organized around a central co-ordination processor. This co-ordination processor is typically a low- power, low-cost ARM or MIPS implementation that runs the host OS or firmware that drives the rest of the system. Depending on system requirements, we may pick SoCs that include vector processors (e.g. ARM NEON), graphics en- gines (e.g. NVidia Tegra, ARM Mali), custom analog blocks (e.g. Cypress SoC), or FPGA logic (e.g. Xilinx Zynq and Altera SOC solutions). While this diversity brings freedom and versatility, designers of these embedded systems are expected to manually decide how to assign and optimize computational kernels for these heterogeneous SoC blocks. The Xilinx ZedBoard platform, shown in Figure 1, has a Zynq SoC which includes the ARM Cortex A9-series CPU as the central co-ordination processor augmented with the NEON SIMD hard vector engines under the same memory hierarchy as well as a tightly-coupled FPGA logic substrate connected over a low-latency, high-bandwidth AXI and ACP interconnect. In this case, the designer has to decide whether to map data-parallel code to (1) simply run as scalar code on the ARMv7 CPU, (2) run as data-parallel code on the NEON hard vector engine, or (3) use FPGA logic. The Nachiket Kapre School of Computer Engineering Nanyang Technological University 50 Nanyang Avenue, S639798 Email: [email protected] Hard silicon blocks Soft programmable logic I ARMv7 9 MXP Soft Vector 32b CPU SIMD Processor * * RF+ Vector Vector Caches RF Scratchpad 1.75kB 2kB 256kB 667MHz 110MHz I AXI-HP I I I 880MB/s Zynq SoC + + I Offchip DRAM I Figure 1: Xilinx Zynq SoC Platform Compute Organization proximity and tight integration of the NEON engine offers a tempting option for effortless acceleration for many existing embedded applications through a recompile of the code. However, if power utilization and absolute performance is desired, the FPGA logic may offer a superior alternative. To retain the fast development times and performance of mapping to NEON, we consider the Vectorblox MXP [12] soft vector processor as the alternative implementation. MXP is an FPGA-based soft vector processor designed to perform data-parallel tasks with high performance. It is able to deliver high performance due to parallel scratchpad accesses and support for fast DMA transfers. When using the MXP soft vector engine, the developers can choose the number of vec- tor lanes, scratchpad capacity, precision and other parameters that match application requirements. We show specifications of the NEON and MXP block in Table I. In this paper, we develop an automated framework to select between soft vector processors (prograנed on FPGA logic) and embed- ded hard vector blocks for acceleration of data-parallel code in embedded SoC platforms. Our amework also generates optimized NEON code using auto-vectorization and explicit generation of NEON intrinsics if desired. With this framework, we address the following questions: Under what conditions does the MXP soft vector pcessor

Transcript of [IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL) -...

Page 1: [IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL) - Munich, Germany (2014.9.2-2014.9.4)] 2014 24th International Conference on Field Programmable

Comparing Soft and Hard Vector Processing in FPGA-based Embedded Systems

Ahstract-

Soh Jun Jie

School of Computer Engineering

Nanyang Technological University

50 Nanyang Avenue, S639798

Email: [email protected]

Soft vector processors can augment and extend the capability of embedded hard vector processors in FPGA-based SoCs such as the Xilinx Zynq. We develop a compiler framework and an auto-tuning runtime that optimizes and executes data-parallel computation either on the scalar ARM processor, the embedded NEON engine or the Vectorblox MXP soft vector processor as appropriate. We consider computational conditions such as pre­cision, vector length, chunk size, 10 requirements under which soft vector processing can outperform scalar cores and hard vector blocks. Across a range of data-parallel benchmarks, we show that the MXP soft vector processor can outperform the NEON engine by up to 3.95 x while saving 9% dynamic power (0.1 W absolute). Our compilation and runtime framework is also able to outperform the gcc NEON vectorizer under certain conditions by explicit generation of NEON intrinsics and performance tuning of the auto-generated data-parallel code.

I. INTRODUCTION

Embedded computing SoCs (systems-on-chip) increas­

ingly support a heterogeneous assembly of p,architectures

and co-processors organized around a central co-ordination

processor. This co-ordination processor is typically a low­

power, low-cost ARM or MIPS implementation that runs

the host OS or firmware that drives the rest of the system.

Depending on system requirements, we may pick SoCs that

include vector processors (e.g. ARM NEON), graphics en­

gines (e.g. NVidia Tegra, ARM Mali), custom analog blocks

(e.g. Cypress SoC), or FPGA logic (e.g. Xilinx Zynq and

Altera SOC solutions). While this diversity brings freedom

and versatility, designers of these embedded systems are

expected to manually decide how to assign and optimize

computational kernels for these heterogeneous SoC blocks.

The Xilinx ZedBoard platform, shown in Figure 1, has a

Zynq SoC which includes the ARM Cortex A9-series CPU

as the central co-ordination processor augmented with the

NEON SIMD hard vector engines under the same memory

hierarchy as well as a tightly-coupled FPGA logic substrate

connected over a low-latency, high-bandwidth AXI and ACP

interconnect. In this case, the designer has to decide whether

to map data-parallel code to (1) simply run as scalar code

on the ARMv7 CPU, (2) run as data-parallel code on the

NEON hard vector engine, or (3) use FPGA logic. The

Nachiket Kapre

School of Computer Engineering

Nanyang Technological University

50 Nanyang Avenue, S639798

Email: [email protected]

Hard silicon blocks Soft programmable logic

I ARMv7

9 MXP Soft Vector

32b CPU SIMD Processor

* * RF+

� Vector Vector

Caches RF Scratchpad 1.75kB 2kB 256kB

-+ 667MHz -+ 110MHz I AXI-HP I

I I 880MB/s Zynq SoC

+ +

I Offchip DRAM I Figure 1: Xilinx Zynq SoC Platform Compute Organization

proximity and tight integration of the NEON engine offers a

tempting option for effortless acceleration for many existing

embedded applications through a recompile of the code.

However, if power utilization and absolute performance is

desired, the FPGA logic may offer a superior alternative.

To retain the fast development times and performance of

mapping to NEON, we consider the Vectorblox MXP [12]

soft vector processor as the alternative implementation. MXP

is an FPGA-based soft vector processor designed to perform

data-parallel tasks with high performance. It is able to deliver

high performance due to parallel scratchpad accesses and

support for fast DMA transfers. When using the MXP soft

vector engine, the developers can choose the number of vec­

tor lanes, scratchpad capacity, precision and other parameters

that match application requirements. We show specifications

of the NEON and MXP block in Table I. In this paper,

we develop an automated framework to select between soft

vector processors (progranuned on FPGA logic) and embed­

ded hard vector blocks for acceleration of data-parallel code

in embedded SoC platforms. Our framework also generates

optimized NEON code using auto-vectorization and explicit

generation of NEON intrinsics if desired.

With this framework, we address the following questions:

Under what conditions does the MXP soft vector processor

Page 2: [IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL) - Munich, Germany (2014.9.2-2014.9.4)] 2014 24th International Conference on Field Programmable

offer superior performance and power benefits compared

to NEON hard vector engines? How do we automate this

decision process while generating optimized code for these

systems? How do we optimize the configuration and opera­

tion of the soft vector processor without relying on manual

tuning?

Consider the simple example a· x2 + b· x + c implemented

inside a data-parallel for loop operating on 8-bit data. When

the loop trip count is low «128 with warmed-up caches),

we might as well avoid vectorization (hard or soft) and

simply run code as scalar instructions on the ARM CPU.

For a larger trip count (>�0.5M with uncached data), we

can exploit automatic gcc vectorization for NEON (or when

using NEON intrinsics) for achieving a 3.7 x speedup. With

NEON intrinsics we can achieve somewhat larger speedups

in some instances compared to auto-vectorization. Finally,

the MXP soft vector processor delivers faster runtime at vir­

tually all trip counts (�1.3 x faster than NEON). The result

of these decisions change with the benchmark, precision and

internal state requirement of the computation. Our compiler

and runtime system help the developer navigate this space

of choices.

The key contributions of this paper include:

• Development of a compiler backend that targets ARM

NEON intrinsics, ARM NEON gee and the MXP soft

vector processor.

• Design of an auto-tuning framework that helps select

between ARM scalar, NEON hard vector and MXP soft

vector engine.

• Performance and Power characterization of the different

hardware configurations using the ZedBoard across a

range of data-parallel kernels.

II. BACKGROUND

A. Zynq SoC Platform

In Table I, we show the key architecture specifications of

the scalar ARM CPU, the NEON SIMD engine as well as

the fully parallel, best-case configuration of the MXP soft

processor on the Xilinx ZedBoard platform. As we can see,

the 28nm hard silicon ARM Cortex A9 CPU and NEON

SIMD engines are configured to run at 667MHz (-1 speed

grade part) yielding an 8b peak throughput of 0.667 Gop/s

and 3.9 Gops/s respectively. The 16-lane MXP processor

can only reach 110 MHz but is still capable of 8b peak

throughput of 7.04 Gop/s. While the MXP can scale up to

128 lanes, the Zynq device on the ZedBoard limits us to

16 lanes. When mapping data-parallel computation to the

NEON engine, we must load/store the vector operands into

a constrained vector register file (32 x 64b) whereas the MXP

processor permits loads/stores from a software-managed

scratchpad up to 256kB (again limited by the ZedBoard

capacity). This disparity in register storage state allows the

MXP to support multiple intermediate live variables without

2

Table I: Architecture Specifications

Metric ARMv7 CPU NEON [1] MXP [12]

j.larch Scalar SIMD Soft vector

Clock Freq. 667 MHz 667 MHz 110 MHz

Throughput

32b (Gops/s) 0.6 1.3 1.7

16b (Gops/s) 0.6 2.6 3.5

8b (Gops/s) 0.6 3.9 7

Memory 1.75kB 2kB 4-256kB

(Scalar RF) (Vector RF) (Scratchpad)

Lanes 1 2 x 32b 1-16 x 32b

4 x 16b 2-32 x 16b

8 x 8b 4-64 x 8b

incurring memory transfer penalties. Despite being a hard

vector core, the NEON SIMD engine does permit a degree

of programmability; we are allowed to switch between 8

lanes at 8b, 4 lanes as 16b and 2 lanes at 32b when

using doubleword vectors. The programming challenge is

to optimize code under these user-selectable metrics and

architecture parameters.

B. Programming Vector Architectures

Progranuning the NEON SIMD engine can be as trivial

as invoking gee with the right compiler options. The

gee compiler has an auto-vectorizing backend that de­

tects suitable for loops and converts them into NEON­

compatible code. However, the compiler might miss several

opportunities for parallelization including register reuse in

the small vector register file and potential memory transfer

optimizations. Furthermore, when programming the MXP

soft processor, the programmer may be unable to easily pick

the optimal set of resource parameters (e.g. vector lanes,

scratchpad size) that best match the design requirements.

This may result in overprovisioned resources and sub­

optimal performance.

To address these challenges, we select the open-source

SCORE [2] compiler framework as the development envi­

ronment for constructing our vectorizing backends. SCORE

is a stream-oriented framework for reconfigurable execution

that supports automated compilation of high level parallel

dataflow operators to low level hardware. Computations

described in SCORE obey the dataflow compute model.

Originally developed for streaming circuit design, it has

been extended to support a varity of other backends such as

sequential control [6], and fixed-point ciruit generation [9],

[5]. In this paper, we adapt the compiler to support two

extra backends for the NEON SIMD engine and the MXP

soft processor.

Page 3: [IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL) - Munich, Germany (2014.9.2-2014.9.4)] 2014 24th International Conference on Field Programmable

(a) SCORE (b) ARMv7 C (c) NEON intrinsics (d) MXP c ode

poly ( in p ut int x, o utput int y)

f o r ( i=O;i<N; i++) int8x8_t tl, t2 , t3; vbx_d ma_to_ v e ctor ( x, x_hos t ); vbx_byte _t *11 , *12, *13; vbx(SYB, YMUL, tl, a, x); vbx(YYB, YMUL, t2 , tl, x); { { tl vmul_s8( a, x);

state only( x): y = a*x*x

+ b*x + c ;

y[i] = a*x[i]*x[i] + b*X[ i I

t2 = vmul_s8(tl, x); t3 = vmul_s8( b, x); vbx (SYB, YMUL, t3, b, x);

vbx(YYB, YADD, tl, t2 , t3); vbx(SYB, YADD, y, c , tl); vbx_dma_from_ ve ctor ( y , y _host);

+ c ; t I = vadd_s8 (t2 , t3); y = vadd_s8(tl, c );

} Table II: Comparing code generation for the different backends

III. VECTORIZING COMPILER AND AUTO-TUNING

FRAMEWORK

To achieve best performance and power characteristics

from the embedded platform, system developers will usually

rely on manual mapping, tuning and optimization heuristics,

As discussed earlier in Section I, the availability of hetero­

geneous compute elements further adds to developer burden.

In Figure 2, we show a block diagram of our compiler and

runtime system to help alleviate developer costs.

The goal of our compiler framework is to generate op­

timized code for the three targets on the Zynq platform.

This allows the developers to describe their data-parallel

kernels once without manual porting effort. Furthermore,

our compiler framework also supports optimizations such

as straightforward register reuse and vector strip mining.

The goal of the auto-tuning step is to (1) pick the best

hardware target for a given problem configuration, and (2)

perform runtime optimizations on data transfer mechanics.

Our auto-tuner runs optimized code under various configu­

rations of bitwidth, lane count, vector length and datasize

to identify the best backend target for our code. From

our experimental results we also identify a simplistic and

preliminary predictive model that can guide the mapping

process.

In Table II, we see the input SCORE program and the

generated backend code for the simple a . x2 + b . x + c example. The SCORE program is a stateless, data-parallel

operation that consumes inputs x to generate y. As SCORE

is originally designed to generate FPGA circuits, this code

is converted to streaming, pipelined Veri log where a new

set of values is injected into the circuit in each cycle. For

the scalar C backend, we perform a simple source-source

translation and wrap the code around a for-loop of user­

specified length. For the NEON backend, we first attempt

gee auto-vectorization of the scalar code by simply using an

appropriate build setup. When we generate NEON intrinsics

directly, we use a dataflow expression rewriting engine that

parallelizes the dataflow expressions and runs a register reuse

optimization pass to conserve the vector registers utilized by

our code. Here, an intrinsic is an inline-assembly function

used in the generated C code that directly calls a documented

ARM NEON instruction at specific bitwidth and lane-count

3

Auto-Tuner (length, precision, chunking, registers)

Figure 2: Block Diagram of the Compiler and Runtime

Framework

combination thereby overriding gee compiler. This could be

useful in cases where the compiler may not correctly detect

NEON optimization opportunities. This may be seen in the

way temporary variable tl gets used twice in the code block.

We use a minor modification of the same backend to also

generate optimized MXP code. In addition to these simple

dataflow code-generation passes, we wrap the code blocks

with double-buffered loads and stores to allow overlapped

evaluation of computation and communication phases (not

shown in Figure II for simplicity).

IV. METHODOLOGY

We develop an experimental methodology to compare the

different vector backends and architectures based on a range

of benchmark problems. In Table III we show our set of

benchmark problems that were derived from kernels from

EEMBC, SCORE and certain OpenCV functions. These

benchmarks are characterized by variations in the compute

(arithmetic), intermediate storage requirements, 10 require­

ments, and precision. We tested the data-parallel problems

on a range of trip counts up to 214 elements.

Page 4: [IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL) - Munich, Germany (2014.9.2-2014.9.4)] 2014 24th International Conference on Field Programmable

Table III: Benchmark Properties

Name Des c ription Ops. Reg. lOs

EEMBC-derived

rgb2gr rgb filter 4 3 3

rgb2lm rgb filter 7 6 4

OpenCY-derived

tap4flt fir filter 8 6 9

FX-SCORE-derived

vecadd vector addition 1 3

poly degree-2 polynomial 3 2 3

matmul 2x2 matrix multiply 12 8 12

dot prod 4-elem dot product 3 2 5 lllin transistor linear 7 18 4

llsat transistor saturation 8 19 4

We use the Xilinx ZedBoard for our experiments. We

use the Xillinux OS v 1.1 (Ubuntu) when executing scalar

ARMv7 and NEON code. We use gee v4.4 for gen­

erating ARM binaries and exploit auto-vectorization for

a portion of our experimental flow. To compile NEON

code using the automatic vectorization pass we use the

-ftree-veetorize switch in conjunction with the

-mfpu=neon option. We use <arm_neon. h> v4.4 when

directly generating NEON intrinsics from the SCORE com­

piler. In this case, we simply supply the -mfpu=neon option. For generating bitstreams for the ZedBoard for

various MXP experiments, we use Xilinx ISE 14.7.

For timing measurements, we use ARM hardware timers

for extracting runtime information of the ARM, NEON and

MXP binaries. For the MXP soft processor, we use the

MXP timing API. For ensuring statistical significance of

measured data, we consider averaged values measured across

several timing runs to minimize the effect of measurement

noise. This also considers the impact of caching to ensure

we perform fair comparisons across the different hardware

backends.

We measure power usage of the ZedBoard using En­

ergenie Power Meter (2% accuracy). We record power

measurements after reaching steady state on the vectorized

code execution. We measure steady state power utilization

for the scalar and NEON implementation when running

Xillyinux from the SDCARD. The MXP implementations

are programmed and executed over the PROGUSB cable

connected to a host PC which adds a non-trivial power

overhead. We recorded a stable 5.1 W steady-state power

utilization when running Xillyinux which rose to 5.5W with

PROGUSB cable connected. We calculate runtime power

utilization from these two baselines for the experiment as

appropriate.

V. RESULTS

Speedups and Runtime Comparisons: We first compare

speedups achieved by the 16-lane MXP soft processor when

4

z

� 3.5

o 2.5 � a.

I 1.5

0.5

1.96 2.01 2.07

3.95

dotprod Illin 11sa! matmul poly rgb2gr rgb21m tap4flt vecadd

Application Kernel

Figure 3: Speedups of MXP over best NEON

implementation

ARMv? Scalar _ NEONGCC _ NEON intrinsics _ MXP (no in line) _

MXP_

r.. I"L I- �

II 1.. fl.-

II II lin 11 sat matnx poly rgb2gr rgb21m tap4flt vecadd

Application Kernel

Figure 4: Comparing runtime of the different application

kernels

compared to optimized NEON vector runtimes in Figure 3.

We observe that in most cases that MXP outperforms NEON

implementations by as much as 3.95 x at 8b computations

which drops to 2.07 x at 32b computations. We would expect

this drop as NEON has 8 8b lanes and only 2 32b lanes (See

Table I). These speedups for the MXP are primarily due

the superior double-buffered memory transfer optimizations

made possible by the fast AXI-FPGA interconnect. We

note that certain application kernels like veeadd show

substantial speedups due to simple data-parallelism with few

intermediate states. In contrast, kernels such as 111 i nand

tap4 f 1 t are unable to beat NEON speeds due to low 10 to compute ratios (See Table III).

In addition to MXP and NEON performance, we are also

interested in understanding the quality of code generated

by our compiler framework. The gee compiler is already

capable of generating vectorized code for NEON backend.

However, in conjunction with explicit intrinsic generation,

dynamic auto-tuning and register allocation passes, our

framework offers the ability to improve NEON performance.

For the MXP soft processor we use the inlining compiler

optimization to lower overheads and tune lane count and

vector strip length as appropriate. To illustrate this, in

Figure 4, we show the runtime comparison between the

ARMv7 scalar CPU, NEON SIMD evaluation and MXP

soft processor mappings for 32b computations. We observe

that the inlining allows MXP code to run marginally faster

due to lower function call overheads. When using explic­

itly generated intrinsics for the NEON, we see a mix of

Page 5: [IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL) - Munich, Germany (2014.9.2-2014.9.4)] 2014 24th International Conference on Field Programmable

Figure 5: MXP Auto-Tuning Dynamic Range

10"' ,-----,---,---,---,---,---,----,---,---,---,

0 · 0 g 0 � 10-7

0

� a:

0 0

� 8 § � � 10�

dotprod 1111n 11sat matmul poly rgb2gr rgb21m tap4flt vecadd

Application Kernel

Figure 6: NEON Auto-Tuning Dynamic Range

improvements and slowdowns across the benchmark set

suggesting additional optimizations may be necessary to

comprehensively beat the gee auto-vectorizer. Additionally,

for certain cases like matrix and poly, we do not observe

benefits for NEON SIMD execution over scalar CPU due to

simplistic calculations and memory-bottlenecks.

Need for Auto-Tuning: To understand the importance

of auto-tuning on user code, we represent the space of

measured runtimes on the MXP soft vector processor in

Figure 5. As you can see, the dynamic range of possible

combinations can be as high as four orders of magnitude.

This large range suggests a critical need to pick the best

parameters. The choice of parameters also affects resource

costs (not shown) but for our experiments, we are already

tightly constrained by the 16-lane limit imposed by the

ZedBoard FPGA capacity. Being a hard vector architecture

with few programmable elements, the auto-tuning range is

much lower for the NEON as shown in Figure 6. However,

the order of magnitude range is still large enough to merit

a tuner-assisted optimization.

Tuning MXP Performance: When designing soft vector

processors, we have the unique capability of being able to

select the best lane configuration desired for optimum per­

formance while keeping resource utilization low. For small

datasets, we would expect fewer parallel lanes to offer better

value for resource consumption as we will be unable to fully

saturate all lanes. Under bandwidth-limited circumstances,

excessively large lane counts would be equally wasteful of

resources. Most applications we characterized preferred a

lane count of 8 or 16 with sufficient parallel work. As shown

in Figure 7, our auto-tuner identifies poly benchmark as

5

:§: E Ql E Ql Qi Q; a. Ql E i=

vecadd ..... rgb2gr ···B···

rgb21m - ... .

1.2*10.7

6.0*10.8

3.0*10.8

1.5*10.8

7.5*10.9

3.7*10.9 1

dotprod poly ··.· 111in ...... -

2

11sat ....... tap4f1t ......,....

4 Vector Lane (L)

• •• •• • 0 .0 .0 o.

8 16

Figure 7: Tuning Lane Width for the MXP soft-processor

(16b data and VL of 1024)

Figure 8: Tuning Vector Length for MXP

(16b data and 16-lane design)

the most scalable design with continuing improvements at

16 lanes, while the veeadd benchmark saturates quickly at

2 lanes due to the accumulation loopback limit. Rest of the

benchmarks show saturated speedups above 8 lanes.

An additional tunable feature of the MXP soft processor,

is our ability to choose the vector strip length. The MXP

eschews the vector register file in favor of a user-managed

scratchpad. In certain cases, we can achieve performance

parity with a larger lane count design simply by choosing

an appropriate vector length for best mapping to a given

scratchpad size. As onchip memory resources can be sub­

stantially denser than logic, this optimization can offer a

cheaper alternative to improving vector code performance

than adding expensive additional lanes. In Figure 8, we show

the impact of varying the vector length with performance.

For our benchmarks, the largest achievable vector length can

vary between 1024-4096 due to variations in 10 capacity

and internal state requirements (see Table III). However,

performance of most benchmarks saturates above a vector

length of 1024 in any case.

Different data-parallel workloads require the vectorizable

loop body to run for varying trip counts. In Figure 9, we

observe the impact of overall trip counts on performance.

10-11

10-10

10-9

10-8

10-7

10-6

10-5

dotprod l1lin l1sat matmul poly rgb2gr rgb2lm tap4flt vecadd

Tim

e p

er

ele

me

nt

(s)

Application Kernel

9.3*10-10

3.7*10-9

1.5*10-8

6.0*10-8

2.4*10-7

9.5*10-7

3.8*10-6

1.5*10-5

1 4 16 64 256 1024 4096

Tim

e p

er

ele

ment (s

)

Vector Length (VL)

matmulvecaddrgb2grrgb2lm

dotprodpolyl1lin

l1sattap4flt

Page 6: [IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL) - Munich, Germany (2014.9.2-2014.9.4)] 2014 24th International Conference on Field Programmable

Figure 9: Impact of Total Element Count

(l6b data, 16-lane design)

Table IV: Power Consumption Measurements

Configuration Power � (% over Idle)

Idle 5.1W 0 0

ARM v7 Scalar 5.9W O.8W 15

NEON Vector 5.7W O.6W 11

Idle+PROGUSB 5.5W 0 0

MXP Vector 5.6W O.IW 2

We note peculiar scaling behavior for certain workloads

where performance stays stable even at larger datasizes.

These workloads are those that have high arithmetic intensity

i.e. fewer 10 load/stores and register requirements per vector

computation. For workloads with larger 10 and intermediate

register needs, runtime is dominated by memory bottlenecks.

Power Usage: Finally, in Table IV, we show the overall

power consumption of the ZedBoard used in our experi­

ments. We note that the use of ARMv7 CPU or NEON for

running our benchmarks increases power consumption from

an idle consumption of 5.IW to 5.9W and 5.7W respectively

(mean). This indicates an improvement of 0.2W when ac­

celerating code on the NEON engine. When using MXP, the

power consumption increases from an idle consumption of

5.5W (due to PROGUSB cable requirements) to 5.6W. This

represents a modest increase of 0.1 W in power usage when

running code on the FPGA-based soft vector processor. Even

in absolute terms, this is still a 0.1 W improvement over

NEON execution.

V I. DISCUSSION

From our experiments we can draw the higher-level con­

clusion that soft vector processors such as the MXP architec­

ture are capable of matching and exceeding the performance

of hard vector processors such as NEON in FPGA-based

heterogeneous embedded systems. For simple, data-parallel

computations with low compute requirements (poly), vec­

tor performance scales with increasing lane counts. For

computations with accumulation (matmul, tap4flt), we

observe limited scalability due to the feedback loop. An

6

optimized reduction primitive would be a useful addition to

the vector architecture for such computations. As expected,

8b vectors permit larger lane counts and consequently higher

speedups.

V II. CONTEXT

Many soft vector processors [3], [11], [14], [13], [7] and

their building blocks have been reported in literature. Each

of these studies represent increasingly refined designs of

FPGA-based vector units for improved performance and

reduced resource utilization. VESPA [13] introduced the idea

of soft-processors that use VIRAM-like vector processing

engines on FPGA fabrics while showing speedups over or­

dinary scalar soft processors. VIPERS [14] provided a NIOS

host to the soft vector engine and demonstrated the impor­

tance of memory scaling on performance. The improved

VEGAS [3] soft vector processor added local scratchpads

and demonstrated comparable performance against Intel SSE

SIMD execution for the integer matmul benchmark (:=:::;2x

faster). VENICE [11] is a resource-optimized VEGAS that

shows 1O-200x speedup against the NIOS-IUf processor

embedded in soft logic on Altera FPGA platforms such as

the DE2-115 and DE4 boards. BlueVec [10] is a Bluespec­

based vector processor developed for applications that are

constrained by the FPGA memory wall. While VIPERS,

VESPA, VEGAS, VENICE and BlueVec operate on integer

data, the vector architecture presented in [7] works with

floating-point operands. For our experiments, we use the

MXP [12] soft vector processor with supports integer and

fixed-point calculations only. While existing vector archi­

tectures offer superior performance when using FPGA logic,

the performance comparisons have always been against other

soft processors with poor performance e.g. MicroBlaze,

NIOS, other lightweight soft processors or I-lane self­

comparisons. In this study, we compare the MXP soft vector

engines against the 667MHz hard silicon processor ARM

Cortex A9 along with NEON SIMD vector engine that

is of direct relevance in embedded system environments.

Furthermore, the ARMv7 CPU and NEON engines offer

substantially higher throughput over the NIOS, MicroBlaze

and other soft processor used in earlier studies.

For a long time, soft vector architectures lacked a high­

level programming model that can make it easy to target

these platforms. While vectorizing compilers are not new,

even gee supports auto-vectorization, it is not trivial to

adapt these existing backends to support newer, custom

vector architectures easily. Hence, earlier versions of soft

vector processors were programmed directly in low-level

API calls to the instruction handling logic. The Accel­

erator [8] compiler showed how to auto-generate Venice­

compatible code from high-level descriptions of parallel

programs. The SCORE framework is a similar high-level

programming environment for data-parallel stream compu­

tations. Unlike Accelerator, we develop an auto-tuning pass

2.4*10-7

9.5*10-7

3.8*10-6

1.5*10-5

6.1*10-5

2.4*10-4

9.8*10-4

29

210

211

212

213

Tim

e p

er

ele

ment (s

)

Datasize (N)

matmulrgb2grrgb2lm

dotprodpolyl1lin

l1sattap4flt

Page 7: [IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL) - Munich, Germany (2014.9.2-2014.9.4)] 2014 24th International Conference on Field Programmable

that helps choose the best soft vector configuration (number

of lanes, chunk size, computation-conununication overlap)

without progranuner involvement. The custom vector com­

piler [4], extracts custom vector instructions to generate

CVUs (custom vector units) based on frequently occurring

patterns in data-parallel programs. Our emphasis, in this

paper, is on a simpler RISC-like vector architecture observed

in NEON and the MXP processor and an associated compiler

framework to target these architectures.

V III. FUTURE WORK

At present, we rely on gee auto-vectorization but intend

to investigate the potential of using armee for generating

better vector code that may close the gap with SCORE­

generated code. As part of future work, we will also consider

partitioning computation across all three backends (scalar,

NEON SIMD and MXP soft vector) together. Furthermore,

larger Zynq SoCs will permit MXP configurations larger

than 16 lanes and a larger scratchpad. While MXP currently

uses a single 64b AXI-HP lane, the switch to ACP and use

of multiple AXI-HP ports will offer further improvements in

DMA throughput. Furthermore, we can consider larger Zynq

devices (than the Z7010 part on the ZedBoard) to support

larger MXP lane counts and consequently higher speedups

for vectorizable computations with low 10 requirements.

IX. CONCLUSIONS

Heterogeneous FPGA-based SoCs like Xilinx Zynq offer

a novel computing platform for embedded processing that

allows us to unify a scalar ARMv7 core, hard vector

NEON SIMD engines as well as the FPGA-based MXP

soft vector processor in the same chip. Using our compiler

and auto-tuning framework, we demonstrate speedups as

high as 3.95 x for offioaded data-parallel computation when

comparing an MXP soft vector processor to optimized

NEON mappings across a range of embedded data-parallel

kernels. MXP is able to outperform NEON due to better

customizability of vector lane counts, scratchpad size and

memory transfer optimizations. In addition to performance

improvements, we also measure ;:::::9% power savings when

choosing FPGA-based vector acceleration over NEON. On

larger FPGA-based SoCs with the already-available faster

interconnect options, we expect MXP code to provide scal­

able performance when compared to the hard silicon NEON

engines. We will be making the benchmark codes and the

compiler frameworks open-source for community use.

REFERENCES

[1] ARM Ltd. Cortex-A9 NEON Media Processing Engine. pages 1-49, July 2011.

[2] E. Caspi, M. Chu, R. Huang, J. Yeh, J. Wawrzynek, and A. DeHon. Stream computations organized for reconfigurable execution (SCORE). Field-Programmable Logic and Appli­cations: The Roadmap to Reconfigurable Computing, 2000.

7

[3] c. H. Chou, A. Severance, A. D. Brant, Z. Liu, S. Sant, and G. G. Lemieux. VEGAS: Soft vector processor with scratchpad memory. In Proceedings of the 19th ACMISIGDA international symposium on Field programmable gate arrays, pages 15-24. ACM, 2011.

[4] J. Cong, M. A. Ghodrat, M. Gill, H. Huang, B. Liu, R. Prab­hakar, G. Reinman, and M. Vitanza. Compilation and archi­tecture support for customized vector instruction extension. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, pages 652-657. IEEE, 2012.

[5] Y. Deheng and N. Kapre. MixFX-SCORE: Heterogeneous Fixed-Point Compilation of Dataflow Computations. pages 1-4, Mar. 2014.

[6] N. Kapre and A. DeHon. VLIW-SCORE: Beyond C for Sequential Control of SPICE FPGA Acceleration. In Field­Programmable Technology (FPT), 2011 International Confer­ence on, pages 1-9, Dec. 2011.

[7] J. Kathiara and M. Leeser. An Autonomous Vector/S-calar Floating Point Coprocessor for FPGAs. In Field­Programmable Custom Computing Machines (FCCM), 2011 1EEE 19th Annual International Symposium on, pages 33-36. IEEE,2011.

[S] Z. Liu, A. Severance, S. Singh, and G. G. Lemieux. Acceler­ator compiler for the venice vector processor. In Proceedings of the ACMIS1GDA international symposium on Field Pro­grammable Gate Arrays, pages 229-232. ACM, 2012.

[9] H. Martorell and N. Kapre. FX-SCORE: A Framework for Fixed-Point Compilation of SPICE Device Models us­ing Gappa++. In IEEE 1nternational Symposium on Field­Programmable Custom Computing Machines, pages 77-S4, Mar. 2012.

[10] M. Naylor, P. J. Fox, A. T. Markettos, and S. W. Moore. Man­aging the FPGA memory wall: Custom computing or vector processing? In Field Programmable Logic and Applications (FPL), 2013 23rd 1nternational Conference on, pages 1-6. IEEE, 2013.

[11] A. Severance and G. Lemieux. VENICE: A compact vector processor for FPGA applications. In Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on, pages 245-245. IEEE, 2012.

[12] A. Severance and G. G. Lemieux. Embedded supercom­puting in FPGAs with the VectorBlox MXP matrix proces­sor. In Hardware/Software Codesign and System Synthesis (CODES+ ISSS), 2013 International Conference on, pages 1-10. IEEE, 2013.

[13] P. Yiannacouras, J. G. Steffan, and J. Rose. VESPA: portable, scalable, and flexible FPGA-based vector processors. In Pro­ceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems, pages 61-70. ACM, 200S.

[14] J. Yu, C. Eagleston, C. H.-y. Chou, M. Perreault, and G. Lemieux. Vector processing as a soft processor accel­erator. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2(2):12, 2009.