ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1...

27
Confidential © ALMARVI Consortium Page 1 of 27 ALMARVI “Algorithms, Design Methods, and ManyCore Execution Platform for LowPower Massive DataRate Video and Image Processing” Project cofunded by the ARTEMIS Joint Undertaking under the ASP 5: Computing Platforms for Embedded Systems ARTEMIS JU Grant Agreement no. 621439 Scalability, quality and usability of the execution platform D3.5 Due date of deliverable: March 31, 2016 Start date of project: 1 April, 2014 Duration: 36 months Organisation name of lead contractor for this deliverable: TU Delft Author(s): Zaid AlArs (TUDelft) Validated by: Joost Hoozemans (TUDelft) Version number: 1.0 Submission Date: March 31, 2016 Doc reference: ALMARVI_D3.5_final_v10 Work Pack./ Task: WP 3, Task 3.1 Description: (max 5 lines) This report presents the quality aspects of the hardware configurations w.r.t. performance improvements, power/energy efficiency, scalability, and usability of the execution platform. Nature: R=Report Dissemination Level: CO Confidential, only for members of the consortium (including the JU)

Transcript of ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1...

Page 1: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

Confidential   ©  ALMARVI  Consortium   Page  1  of  27  

   

ALMARVI  “Algorithms,  Design  Methods,  and  Many-­‐Core  Execution  Platform  for  Low-­‐Power  

Massive  Data-­‐Rate  Video  and  Image  Processing”  

Project  co-­‐funded  by  the  ARTEMIS  Joint  Undertaking  under  the  ASP  5:  Computing  Platforms  for  Embedded  Systems  

ARTEMIS  JU  Grant  Agreement  no.  621439  

Scalability,  quality  and  usability  of  the  execution  platform  D3.5  

Due  date  of  deliverable:  March  31,  2016  

Start  date  of  project:  1  April,  2014     Duration:  36  months  

Organisation  name  of  lead  contractor  for  this  deliverable:   TU  Delft    

Author(s):   Zaid  Al-­‐Ars  (TUDelft)  

Validated  by:   Joost  Hoozemans  (TUDelft)  

Version  number:     1.0  

Submission  Date:   March 31, 2016  

Doc  reference:     ALMARVI_D3.5_final_v10  

Work  Pack./  Task:   WP  3,  Task  3.1  

Description:  (max  5  lines)  

This   report   presents   the   quality   aspects   of   the   hardware   configurations   w.r.t.  performance   improvements,   power/energy   efficiency,   scalability,   and   usability   of   the  execution  platform.    

Nature:   R=Report  

Dissemination  Level:   CO   Confidential,  only  for  members  of  the  consortium  (including  the  JU)  

Page 2: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  2  of  27  

DOCUMENT  HISTORY  

Release   Date   Reason  of  change   Status   Distribution  

V0.1   14/12/2015   Initial  document  organization   draft   CO  

V0.2   11/2/2016   Contributions  from  TUD-­‐UTIA-­‐NOK-­‐TUT-­‐UTURKU  added  

draft   CO  

V0.3   13/3/2016   Merged  contribution  of  partners   draft   CO  

V0.4 28/3/2016 Introduction and conclusions updated draft CO

V1.0 31/3/2016 Submitted to Artemis final CO

                                                               

Page 3: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  3  of  27  

Contents  1. Introduction  .................................................................................................................................................  4 2. Nokia/TUT/UTURKU  platform  ......................................................................................................................  5

2.1. Performance  improvements  ......................................................................................................................  6 2.2. Power/energy  efficiency  .........................................................................................................................  11 2.3. Scalability  ................................................................................................................................................  12 2.4. Usability  ..................................................................................................................................................  12

3. TUDelft  platform  ........................................................................................................................................  14 3.1. Performance  improvements  ...................................................................................................................  14 3.2. Power/energy  efficiency  .........................................................................................................................  15 3.3. Scalability  ................................................................................................................................................  16 3.4. Usability  ..................................................................................................................................................  17

4. UTIA  platform  ............................................................................................................................................  19 4.1. Performance  improvements  ...................................................................................................................  19 4.2. Power/energy  efficiency  .........................................................................................................................  21 4.3. Scalability  ................................................................................................................................................  22 4.4. Usability  ..................................................................................................................................................  24

5. Conclusions  ................................................................................................................................................  26 6. References  .................................................................................................................................................  27    

Page 4: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  4  of  27  

1. Introduction  

 This  report  represents  deliverable  D3.5,  which   is  part  of  Task  3.1   in  WP3  of  the  ALMARVI  project.  D3.5  presents  the  quality  aspects  of  the  hardware  configurations  w.r.t.  performance  improvements,  power/energy  efficiency,  scalability,  and  usability  of   the  execution  platform.  This  deliverable  builds  on  D3.1,  where   the  hardware  platform  solutions   for  image/video  processing  were  initially  described.    D3.5   determines   the   quality   of   appropriate   architecture   configuration   defined   by   number   and   types   of   cores   and  specialized   instructions  for   image  processing  that  may  be  reconfigured  on  FPGAs.  Reconfigurations,  adaptability  and  scalability  of  the  hardware  configuration  are  important  issues  when  considering  the  cross-­‐domain  nature,  variability  in  acceleration   fabrics,   and   image/video   workloads.   Therefore,   special   attention   will   be   on  methods   for   adaptability,  enabling   the   exchange   of   processing   between   the   different   elements   in   the   configuration,   to   optimally   use   the  properties  of  the  hardware  at  hand.  This  deliverable  also  describes  integration  of  heterogeneous  acceleration  fabrics,  interconnect   and  protocols   for   the  platforms,   such   that   interfaces   between   the  different   kinds   of   hardware   can  be  used   in   the  most   effective  way.   Configuration   choices   are   application-­‐/domain-­‐specific   to   take   into   account   quality  issues  like  energy,  quality  of  service,  throughput,  etc.,  while  allowing  for  massive  real-­‐time  low-­‐power  data  processing.    The  partners  involved  in  delivering  ALMARVI  execution  platforms  are  NOK,  TUT,  UTURKU,  TUDelft  and  UTIA.  Section  2  describes  the  execution  platform  developed  by  NOK,  TUT  and  UTURKU.  Sections  3  and  4  describe  the  platforms  of  the  TUDelft  and  UTIA,  respectively.    

Position  of  D3.5  in  the  context  of  the  ALMARVI  project  within  WP3  

Contribution of D3.5

Page 5: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  5  of  27  

2. Nokia/TUT/UTURKU  platform  

As   agreed   in   the   plan   for   ALMARVI   common   system   software   stack,   we   have   adopted   OpenCL   as   a   parallel  programming  language  for  the  heterogeneous  architecture.  OpenCL  allows  the  programmer  to  describe  the  program  parallelism  by   expressing   the   computation   in   the   Single   Program  Multiple  Data   (SPMD)   style.   In   this   style,  multiple  parallel   work-­‐items   execute   the   same   kernel   function   in   parallel   with   synchronization   expressed   explicitly   by   the  programmer.   Another   key   concept   in   OpenCL   is   the  work-­‐group  which   collects   a   set   of   coupled  work-­‐items  which  might  synchronize  with  each  other.  However,  across  multiple  work-­‐groups  executing  the  same  kernel  there  cannot  be  data  dependencies.  These  concepts  allow  exploiting  parallelism  in  multiple  levels  for  a  single  kernel  description;  inside  a  work-­‐item,  across  work-­‐items  in  a  single  work-­‐group  and  across  all  the  work-­‐groups  in  the  work-­‐space.    At   the   highest   level,   the  OpenCL   command   queues   are   a  means   to   describe   task   level   parallelism   and   to  map   the  execution  of  a  larger  multi-­‐kernel  application  to  a  heterogeneous  system  with  various  device  types.  This  level  will  be  evaluated   in   the   Zynq  demonstrator  which   supports   both   tailored  devices   implemented   in   the   FPGA   fabric,   and   an  ARM   CPU   device.   In   this   setup,   the   whole   application   can   be   functionally   verified   by   writing   an   OpenCL   host  application  that  is  executed  in  the  ARM  host  or  in  any  OpenCL-­‐supported  desktop  environment.  For   the   OpenCL   implementation,   the   pocl-­‐project   is   used   as   a   basis.   The   open   source   OpenCL   implementation   is  ported  to  the  Zynq  platform  in  a  way  that  the  whole  demonstration  setup  can  be  controlled  by  means  of  a  single  pocl  OpenCL  context.    Customized   processors   provide   a   middle   ground   between   fixed   function   accelerators   and   generic   programmable  cores.   They   bring   benefits   of   hardware   tailoring   to   programmable   designs,   while   adding   new   advantages   such   as  reduced   implementation   verification   effort.   The   hardware   of   customized   processor   is   optimized   for   executing   a  predefined  set  of  applications,  while  allowing  the  very  same  design  being  used  to  run  other,  close  enough  routines  by  switching  the  executed  software  in  the  instruction  memory.  The  degree  of  processor  hardware  tailoring  is  dictated  by  the  use  case  and  the  targeted  product.    In   any   case,   the   processor   customization   process   is   highly   demanding   and   error-­‐prone,   with   high   non-­‐recurring  engineering  costs.  Moreover,  as  the  design  process  of  customized  processors  is  usually  iterative  in  nature,  porting  the  required  software  program  codes  to  new  processor  variations  needs  either  assembly  language  rewrites  or  retargeting  the  compiler.  One  approach  to  simplifying  the  processor  customization  process  is  to  compose  the  processor  from  a  set  of   component   libraries   and   other   verified   building   blocks,   thereby   reducing   the   required   verification   effort.   The  software  porting  problem  can  be  alleviated  with  automatically  retargeted  software  development  kits.    TTA-­‐Based  Co-­‐Design  Environment  (TCE),  a  processor  design  and  programming  toolset  which  is  based  on  a  processor  template  that  supports  different  styles  of  parallelism  efficiently.  TCE  enables  rapid  design  of  cores  ranging  from  tiny  scalar  microcontrollers   to  multicore  vector  machines  with  a   resource  oriented  design  methodology   that  emphasizes  reuse  of  components.    We  have  selected  a  few  key  use-­‐cases  to  challenge  the  TCE  and  pocl  toolsets  to  design  application  specific  designs  to  meet  the  requirements.  

1) 4G  LTE  is  a  standard  of  high-­‐speed,  low  latency,  data  for  wide-­‐area  cellular  communications,  and  builds  upon  the   technologies   developed   by   3GPP   project.   The   most   demanding   and   compute   intensive   algorithms   in  modern   radio   receiver   relate   to   signal   detection   and   demodulation.   MIMO   technique   employs   multiple  transmitter  and  receiver  antennas  for  transmitting  multiple  parallel  data  streams.  As  the  system  employs  M  x  N   different   paths   for   the   signal   higher   diversity,   more   reliable   communications   and   higher   throughput   is  achieved.  The  expense  to  be  paid   is  the  higher  receiver  complexity  with  exponential  complexity   increase  to  the  MIMO  and  modulation  order.  LTE  provides  several  device  categories  up-­‐to  600  Mbit/s  downlink  data  rate  with   4  MIMO   layers.   Design  must   be   scalable   so   that   different   device   categories   could   be   supported  with  simple  parameterization  of  the  architectural  template  to  rapidly  produce  processor  designs  with  a  tradeoff  in  area   and   performance.   In   addition,   in   run-­‐time,   depending   on   the   channel   conditions   more   robust  demodulation  algorithm  could  be  selected  in  favor  of  demodulation  speed.  We  have  stated  the  power  budget  for  the  designed  processor  to  be   less  than  1W  in  order  for  the  design  to  be  suitable  to  mobile  handset  use  scenarios.  

2) In  second  co-­‐processor  design  we  perform  audio  signal  processing  in  a  wearable,  always-­‐on  type  of  a  device.  The   processor   implements   audio   signal   processing   algorithms   such   as   IIR   biquads,   linear   filters,   spectral  analysis  and  adaptive  filters.  The  input  sample  rate  of  such  systems  is  quite  modest  compared  to  wideband  radio   transceivers,   but   the   use-­‐case   requires   extremely   low   energy   and   power   consumption   due   to   the  limited  power  supply  and  the  small  form  factor  of  the  planned  end  product.  The  design  is  optimized  for  two  

Page 6: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  6  of  27  

use   cases:   active   noise   cancellation   and   “hearing-­‐aid”   –type   of   functionality   for   hearing   enhancement.   For  battery  based  operation  the  power  consumption  of  the  designed  processor  should  be  lower  than  1mW.  

3) Third  case  is  a  custom  processor  targeted  for  running  various  Machine  Learning  algorithms,  and  is  optimized  for   floating   point   calculations.   It   also   uses   the   Transport   Triggered   Architecture   (TTA)   as   the   processor  template  and  was  designed  by  using  TCE  (TTA-­‐based  Co-­‐Design  Environment)  tool  chain.  The  functional  units  (FU)   of   the   processor   are   tailored   towards   implementing   a   wide   variety   of   machine   learning   elementary  operations  making  it  suitable  for  learning  and  classification  parts  of  Tasks  2.3  and  2.4.  

 

2.1. Performance  improvements  

LTE Receiver Compute  capabilities   can  of   course  be  designed   for  best  possible  performance   in  all   conditions.  That,  however,   can  easily  lead  to  receiver  overdesign  and  excessive  usage  of  energy  in  the  computation  units.  It  is  quite  straight  forward  to  select  the  best  algorithm  to  satisfy  the  minimal  service  requirements  for  instance  depending  on  channel  conditions,  modulation  order  and  user  allocation.  For  a  lower  user  allocation  and  low  modulation  order,  more  complex  algorithms  can  always  be  used  to  guarantee  the  lowest  possible  bit  error  rate.  This  is  essential  at  least  in  cell  service  boundary.  In  this   situation   data   rate   and  modulation   order   are   not   limiting   the   usage   of   compute   capability   for  more   effective  algorithm  to  improve  bit-­‐error  rate.  Closer  to  the  serving  base  station,  where  the  signal-­‐to-­‐noise  ratio  and  the  offered  rate  is  high,  simpler  algorithms  can  be  used  because  the  signal-­‐to-­‐noise  ratio  is  not  limiting  the  throughput,  but,  on  the  other  hand,  the  available  compute  capability  limits  the  usage  of  more  complex  algorithms.  In   wireless   multiple-­‐input   multiple-­‐output   (MIMO)   transmission   over   fading   channels   maximum   likelihood   (ML)  detection   is   desired   to   achieve   low   bit   error   rate   performance.  ML   detection,   however,   involves   exhaustive   search  over  all  possible  digitally  modulated  symbols  and  complexity  which  grows  exponentially  with   the  rate.  Therefore,   in  practical  implementations  ML  detection  is  either  approximated  by  an  algorithm  which  limits  the  symbol  search  space  or  is  completely  replaced  with  linear  detection,  thus  sacrificing  equalizer  performance  with  complexity.  For  this  study  we  have  selected  the  two  currently  most  attractive  MIMO  detection  algorithms:  Minimum  Mean  Square  Error  (MMSE)  equalizer  and  Layered  Orthogonal  Lattice  Detector  (LORD),  the  first  being  the  representative  of  linear  equalizers  and  the  second  a  suboptimal  ML  equalizer  with  deterministic  complexity  (latency)  and  soft-­‐output  generation  complexity  which  is  linear  to  the  number  of  transmission  antennas.  Both  algorithms  are  very  practical  for  implementation  as  they  can  be  parallelized  in  many  dimensions  to  utilize  instruction-­‐,  vector-­‐  and  thread-­‐parallel  hardware,  leveraging  either  parallelism  in  the  algorithm  itself  or  in  the  parallel,  independent,  subcarriers  of  the  OFDM  transmission.  The   processor   core   designed   within   the   ALMARVI   project,   called   LordCore,   is   based   on   the   Transport   Triggered  Architecture   paradigm   and   contains   a   512-­‐   bit   wide   SIMD   datapath   for   high   performance   computation.   The   SIMD  datapath  can  perform  calculations  with  32  of  16-­‐  bit  wide  half  precision  floating  point  values  in  parallel.  The  core  also  contains   32-­‐bit   datapath   for   address   and   control   calculations.   The   design   is   multicore-­‐ready   including   a   simplified  synchronization   hardware   with   a   dual-­‐core   configuration   test   chip   is   being   fabricated.   Figure   1   illustrates   the  architecture  of  a  single  core.  The  architecture   is  scalable  to  also  other  SIMD  widths  and  8-­‐  and  16-­‐lane  versions  are  also  developed  for  lower  performance  usage  such  as  LTE-­‐M  and  also  64-­‐lane  version  is  considered.  Each   core   is   connected   to   three   memories;   an   instruction   cache,   a   local   scratchpad   memory,   and   a   global   data  memory.  The  instruction  cache  feeds  the  instructions  to  a  core.  The  cache  is  128-­‐bits  wide  as  the  instructions  are  128  bits  wide  and  has  space  for  1024  instructions,  so  the  total  size  of  the  instruction  cache  per  core  is  16kiB.  The  cache  is  direct   mapped   with   a   cache   line   size   of   32   instructions.   There   is   no   hardware   based   coherence   in   the   instruction  caches   as   they   are   read-­‐   only   from   the   point   of   view   of   the   core.   An   external   invalidate   signal   suffices   for  reprogramming.The  local  scratchpad  memory  is  mapped  to  OpenCL  private  and  local  memory  spaces.  This  memory  is  512-­‐bit  wide  with  a  single  data  port,  allowing  either  one  512-­‐bit  read  or  one  512-­‐bit  write  per  clock  cycle.  The  memory  has   the   capacity   of   32kiB,   which   was   enough   to   store   all   the   temporary   data   needed   in   the   LORD   and   MMSE  algorithms.    The  global  data  memory  is  on-­‐chip,  but  outside  the  core,  shared  by  all  the  cores  of  the  chip  and  connected  to  the  core  through  a  shared  AXI  bus.  Global  memory  access  is  considerably  slower  than  the  local  scratchpad  memory.  Figure  2  shows  the  system  architecture  of  a  chip  containing  two  LordCores.        

Page 7: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  7  of  27  

     

 Figure 1 Architecture of a single core

 Figure 2 Multicore architecture

 

Table 1 Throughput performance numbers of LORD and MMSE algorithms running on the processor on different modes and core counts.

 

Algorithm layers nrx modulation single2core dual2core quad2coreLORD 2 2 QPSK 122.3 226.0 445.3LORD 2 2 161QAM 125.1 241.0 471.5LORD 2 2 641QAM 74.4 145.7 286.3LORD 2 4 QPSK 84.9 161.1 314.3LORD 2 4 161QAM 103.4 199.6 391.0LORD 2 4 641QAM 69.3 135.9 267.1MMSE 2 QPSK 217.9 380.1 703.1MMSE 2 161QAM 426.7 746.4 1382.0MMSE 2 641QAM 548.6 976.2 1849.7MMSE 2 2561QAM 570.9 1042.9 1979.2MMSE 4 QPSK 60.1 117.2 229.0MMSE 4 161QAM 119.3 232.4 454.1MMSE 4 641QAM 171.5 334.7 639.3MMSE 4 2561QAM 209.5 409.3 784.3  

Page 8: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  8  of  27  

Table 2 Comparison to other software MIMO detectors

 

System Algorithm layers nrx Modulation Throughput(Mbps)Quadro'FX'1700[1] LORD 2 2 165QAM "16.8"Proposed(single5core) LORD 2 2 165QAM 125.1GeForce'560'TI'[2] 15way? 2 2 165QAM 834.1GeForce'560'TI'[3] 25way? 2 2 165QAM 402.2Proposed(single5core) MMSE 2 2 165QAM 426.7GeForce'560'TI'[2] 15way? 2 2 645QAM 183.5GeForce'560'TI'[2] 25way? 2 2 645QAM 92.4Proposed(single5core) MMSE 2 2 645QAM 548.6Proposed(single5core) LORD 2 2 645QAM 74.4Proposed(dual5core) LORD 2 2 645QAM 145.7Proposed(quad5core) LORD 2 2 645QAM 286.3Tesla'C1060[3] MTT 4 4 QPSK 284.7Proposed(quad5core) MMSE 4 4 QPSK 229.0Tesla'C1060[3] MTT 4 4 165QAM 120.0GeForce'560'TI'[2] 15way? 4 4 165QAM 782.5GeForce'560'TI'[2] 25way? 4 4 165QAM 386.1Proposed(single5core) MMSE 4 4 165QAM 119.3Proposed(quad5core) MMSE 4 4 165QAM 454.1Tesla'C1060[3] MTT 4 4 645QAM 12.0GeForce'560'TI'[2] 15way? 4 4 645QAM 230.7GeForce'560'TI'[2] 25way? 4 4 645QAM 115.9Proposed(single5core) MMSE 4 4 645QAM 171.5Proposed(quad5core) MMSE 4 4 645QAM 639.3 Audio Signal Processing In  the  second  co-­‐processor  design  we  target  the  audio  signal  processing  in  a  wearable,  always-­‐on  type  of  a  device.  The  processor   is   implementing  audio  signal  processing  algorithms  such  as   IIR  biquads,   linear  filters,  spectral  analysis  and  adaptive  filters.  The  input  sample  rate  of  such  systems  is  quite  modest  compared  to  wideband  radio  transceivers,  but  the  use-­‐case  requires  extremely   low  energy  consumption.  Audio  processing   is  an   inherent  part  of  multimedia  signal  processing  and  cannot  be  neglected.      The  architecture  of  the  audio  demonstrator  setup  is  depicted  in  Figure  3  and  the  actual  processor  design  in  Figure  4.  Main  use  –cases  for  our  demonstrations  has  been  active  noise  cancellation  and  headphone  transparency,  which  can  only  be  achieved  with   low  processing   latency  of   the  processor.   In  Figure  3,   the   in-­‐ear  and  out  ear  microphones  are  sampled  synchronously  at  48kHz  sample  rate.  Initially,  target  for  1/48000  s  delay  was  set  as  the  upper  bound  for  the  signal  processing  latency.  For  the  synthesized  processor  we  have  a  power  budget  of  1mW.  The  demonstrator   is  working   real-­‐time  on   the  Zynq  platform.  The  audio  co-­‐processor   is   synthesized  onto   the  FPGA.  ARM  A9  processors   in   the  Zynq  platform  are  controlling   the  DSP  through  the  shared  memory  and  memory  mapped  registers  of  the  custom  audio  DSP.  Audio   input  as  well  as  the  A/D  and  D/A  converters  are  seen  as  DSP  as  stream  IO  ports,  and  the  mail  loop  of  the  software  is  executed  at  the  rate  as  data  is  available  in  the  stream  I/O  ports.  The  design  depicted  in  Figure  4  is  a  very  simple,  4-­‐bus  TTA  processor,  with  support  for  32-­‐bit   integer  and  single  precision  scalar  arithmetics  and  containing  two  functional  units  for  two  way  single  precision  vector  arithmetics.  Its  I/O  unit  handles  the  streaming   input   and   output.   With   this   design   we   have   achieved   signal   processing   latency   of   1/(8*48000),   thus  exceeding  our  initial  target  by  a  factor  of  8.      

Page 9: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  9  of  27  

Figure 3 Audio Signal Processing Setup

 Figure 4. Audio Processor Design

Machine Learning TTA Ten  data  mining  algorithms  (C4.5,  k-­‐Means,  SVM,  Apriori,  EM,  PageRank,  AdaBoost,  kNN,  Naive  Bayes,  and  CART)  were   identified   for   use   in  work  package  2.  Additionally,   three  more   algorithms   commonly   used   in  practical   data  mining   tasks  were   considered:   logistic   regression   (classification),   the   linear   regression,   and  the  Fast  Fourier  Transform  (FFT)  commonly  used  in  signal  processing.  Figure  5  lists  a  table  of  elementary  operations  on  the  left  hand  side,  and  each  column  corresponds  to  one  of  the   13   data  mining   algorithms.   A   bullet   sign   indicates   the   presence   of   an   elementary   operation   in   that  algorithm.  The  bottom   line   indicates   to  which  data  mining   task  an  algorithm  belongs:  CLA   -­‐  classification,  REG  -­‐  regression,  C&R  -­‐  classification  and  regression,  CLU  -­‐  clustering,  ASS  -­‐  association  rule  discovery,  DSP  -­‐  digital  signal  processing.  We  can  see  from  the  table  that  vector  operations  take  the  priority.  The  first  four  operations   receive  by  a   large  margin  higher  weighted   score   than   the   rest.   These   should  have  priority   for  optimization  on  the  hardware  level  with  application  specific  components  and  memory  addressing.    

In-ear L

In-ear R

Out-ear L

Spk L

Out-ear R

Spk R

AD0AD1AD2AD3DA1DA2

Custom DSP

IO IO

I D

AIN ARM A9

Page 10: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  10  of  27  

 

 The  processor  is  implemented  with  5  transport  busses  divided  into  two  groups.  The  high  level  organization  of  the  core  and  the  FUs  are  presented  in  Fig.  x.  The  first  group  contains  all  integer  units,  load-­‐store  unit,  and  other  miscellaneous  units.  The  second  group  contains  all  of  the  floating  point  units.  The  integer  units  have  three  of  the  five  transport  busses  reserved,  while  the  floating-­‐point  units  use  the  remaining  two  busses.  The  transport   busses   are   not   fully   connected,   as   by   optimizing   the   number   of   connections   to   the   functional  units,  significant  power  savings  can  be  gained.  The  interconnect  network  thus  has  been  heavily  optimized  for  running  machine  learning  applications  instead  of  general  computing  task.    Additionally,   the   processor   includes   a   Timing-­‐Error   replacement   FU   for   increasing   variance   robustness  (Almarvi   Objective-­‐4).   The   methodology   here   is   based   on   having   the   system   operate   at   a   voltage   and  frequency  point  in  which  the  timing  of  critical  paths  fails  intermittently.  These  failed  timing  occurrences  are  detected  by  special  latches  and  handled.  Whether  the  target  is  power  or  energy  savings,  the  detection  and  handling  overhead  has   to  be   lower   than  the  power  savings  resulting   from  the   lower  Vdd.  Here,   the  error  handling  mechanism  of  the  Timing-­‐Error  system  is  replacement.  When  an  error  is  detected,  the  erroneous  value  is  replaced  with  a  predetermined  safe  value.  The  safe  value  is  algorithm  specific.  For  example,  in  the  category   of   iterative   algorithms   working   on   probabilities,   essentially   maximizing   (or   minimizing)   a  probability  metric   by   iterative  means,   the  probability   value   from   the  previous   algorithm   iteration   can  be  used  as  the  safe  value.          

Figure 5 Analysis of elementary computations.

Page 11: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  11  of  27  

   

Figure 6 (left) Simplified high-level architecture view of the implemented TTA processor. Shown are the function units of the processor and on which group of transport bus they connect on. (Right) Chip photograph of the processor.  

 

2.2. Power/energy  efficiency  

LTE Receiver Two-­‐core  version  of  the  processor  was  synthesized  with  Synopsys  Design  Compiler  version  I-­‐2013.12,  using  28nm  Fully  Depleted   Silicon  On   Insulator   (FDSOI)  process   technology.  Switching  Activity   Interchange   Files   (SAIF)  were  produced  from  Modelsim   simulations   and  used   for   power   estimation   in  Design  Compiler.  Only   the   active  processing   time   for  each   test   case   was   included   in   the   SAIFs.   The   designs   were   synthesized   having   leakage   and   dynamic   power  optimization  and  clock  gating  enabled.  The  synthesis   reported  an  achieved  clock   rate  of  950  MHz  with   the  nominal  operating  voltages  and  conditions.  The  estimated  power  consumption  for  the  two  core  version  is  137  mW  (MMSE)  and  163  mW  (LORD),  which  are  well  below  our  targeted  1W  performance  boundary.  Extrapolated  power  consumption  for  a  four  core  version  is  about  270  mW,  which  would  deliver  us  the  targeted  LTE  device  category  11  performance  of  600  Mbit/s.    

Audio Signal Processing The   audio   processor  was   synthesized  with   Synopsys   Design   Compiler   version   I-­‐2013.12,   using   28nm   Fully   Depleted  Silicon   On   Insulator   (FDSOI)   process   technology.   Switching   Activity   Interchange   Files   (SAIF)   were   produced   from  Modelsim  simulations  and  used  for  power  estimation  in  Design  Compiler.  Only  the  active  processing  time  for  each  test  case  was   included   in   the   SAIFs.   The  designs  were   synthesized  having   leakage   and  dynamic  power  optimization   and  clock  gating  enabled.  The  synthesis  reported  an  achieved  target  clock  rate  of  49.152MHz   leveraging  a  sub-­‐threshold  design  methodology  of  University  of  Turku.  The  estimated  power  consumption  for  two  core  version  is  300  µW,  which  is  well  below  our  targeted  1mW  upper  boundary.    

Machine Learning TTA Figure  6  presents   the  detailed   information  of   the   implemented  processor.  The  on-­‐chip  memory   is   foundry   IP  which  was   not   optimized   for   low-­‐voltage   operation   and   is   therefore   situated   in   a   separate   voltage   domain   from   the  processor.  To  lower  dynamic  power  usage  in  processor  core,  clock  gating  was  also  used.  The  processor   is  capable  of  operating  at  an  average  of  110μW  when  performing  machine  learning  algorithms  with  a  minimum  of  5.3pJ/cycle  and  1.8nJ/iteration  for  Incremental  Bayes.      Table 3 Implementation Details of the Processor

Process 28nm FDSOI CMOS

Core area 0.30mm2

Operating voltage Core 0.35V, memory 1.0V

FloatALU

FloatMUL

FloatDIV SQRT Sigmoid Float

Compare

ALU MUL SPI LSU Float-IntConv.

4xRegister

leTER

Page 12: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  12  of  27  

Operating frequency 20.6MHz

Instruction memory 2048 inst. (2048x84 bits)

Data memory 8kB (2048x32 bits)

FP performance 30.9 MFLOPS (1.5 MFLOPS/MHz)

Bus configuration 5 busses. Divided between integer (3 busses) and floating point FUs (2 busses)

Integer FUs ALU, multiplier

Floating point FUs ALU with multiplier, divider, compare unit, sqrt, sigmoid

Other FUs LSU, SPI IO unit, conversion unit (float-to- int, int-to-float), TER

Registers 2x 16x32bit registers 2x 8x32bit registers

 

2.3. Scalability    

LTE Receiver The   architecture  of   the  processor  was  designed   to  be   scalable.   The  memory   architecture   allows   scaling   to  multiple  cores,  and  the  SIMD  width  of  the  processor  can  easily  be  scaled.  32  lanes  was  selected  as  the  default  SIMD  width  due  relatively  wide  SIMD  allowing  more  work  done  per   instruction  bit,  minimizing   the  energy  used   for   instruction   fetch,  while  still  keeping  the  individual  cores  relatively  small  to  make  them  easier  to  synthesize.  Smaller  single-­‐core  8-­‐  or  16-­‐lane  versions  could  be  used  for  smaller  low-­‐throughput  systems  such  as  LTE-­‐M  while  multi-­‐core  64-­‐lane  versions  could  be  used  to  extend  performance  for  future  communication  standards.  The  scalability  of  the  architecture   is  proven  by  the  throughput  performance  simulations  of  the  designed  architecture  depicted  in  Table  1.    

Audio Signal Processing The   architecture   of   the   processor  was   designed   to   be   scalable,   however   having   extremely   tight   power   and   latency  budget.   SIMD  width   of   the   processor   can   easily   be   scaled   as   well   as   the   number   of   functional   units.     Initially   the  processor   was   designed   to   implement   processing   of   the   left   and   right   channels   separately.   However,   it   was   soon  discovered   left   and   right   channels   can   be   conveniently   processed  with   2-­‐way   vector   units   processing   left   and   right  channels  in  lower  and  upper  parts  of  the  vector  with  no  increase  in  code  size,  thus  a  2-­‐wide  SIMD  operation  set  was  added  to  the  design.      

Machine Learning TTA As  the  OS  control  described  in  D1.3  can  also  be  implement  with  the  TER  system,  the  processor  is  voltage  scalable  from  0.35V  to  1V  with  an  approximate  operating  frequency  of  750  MHz  at  1V  (the  exact  frequency  cannot  be  measured  due  to  IO  restrictions.    

2.4. Usability  The  processor  architecture  was  tailored  using  the  TTA-­‐  based  Co-­‐design  Environment  (TCE)  tools  and  its  re-­‐targetable  OpenCL   compiler   [4],   [5]   based   on   the   Transport   Triggered   Architecture   (TTA)   paradigm   [6].   In   transport   triggered  processors,  the  datapath  buses  are  exposed  to  the  programmer:  the  processor  is  programmed  by  scheduling  the  data  transfers  that  take  place.  Actual  operations  (e.g.,  arithmetic  or  memory  operations)  are  executed  when  a  transport  is  made  to  specific  “trigger  port”  of  a  function  unit  implementing  the  operation.    

Page 13: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  13  of  27  

Due  to  making  the  high-­‐level  programming  of  the  designed  processors  a  priority,  a  key  tool  in  TCE  is  its  re-­‐targetable  software  compiler,  tcecc.  The  compiler  uses  the  LLVM  project  as  a  backbone  and  pocl  to  provide  OpenCL  support.  The  frontend  supports  the  ISO  C99  standard  with  a  few  exceptions,  most  of  the  C++98  language  constructs,  and  a  subset  of  OpenCL  C.    Compared   to   traditional   “operation-­‐programmed”   Very   Long   Instruction   Word   (VLIW)   architectures,   where   the  instruction  set  specifies  operations,  and  data  transfers  occur  as  part  of  operations,  the  TTA  programming  model  has  the  benefit  that  the  register  file  bypasses  are  explicitly  programmed  (“software  bypassing”),  and  all   the  operands  of  operations  do  not  have  to  be  read  in  the  same  clock  cycle.  Similarly,  the  computed  results  do  not  have  to  be  read  to  the   destination   register   file   on   the   same   cycle   they   are   produced,   and   the   result  write   to   a   register   can   be   totally  omitted  of  the  result  is  bypassed  directly  to  some  another  operation.  This  allows  using  smaller  register  files  with  less  read  and  write  ports.    Because   in  TTA  processors   the  register   files  and   function  units  are   fully  decoupled   from  the  rest  of   the  architecture  due  to  the  customizable  interconnection  network  and  the  data  transport  programming  model,  it  is  easy  to  design  new  processors  in  a  “component  based”  manner.  During  the  project  it  was  found  that  the  TTA  paradigm  works  extremely  well   for   wide   SIMD   datapaths.   This   is   because   SIMD   instructions   save   instruction   bits   per   operation,   and   thus  instruction   fetch   unit   power,   typically   a   major   pitfall   of   TTA   and   VLIW   type   processors,   while   the   interconnection  networks  and  simplified  register  files  enabled  by  the  TTA  approach  save  power  on  the  datapath  side  where  most  of  the  power  of  streamlined  control  unit  SIMD/VLIW  processors  is  typically  spent.      

LTE Receiver The  algorithms  were  implemented  using  the  OpenCL  language.  OpenCL  allows  using  vector  data  types  to  execute  same  code  on  many   SIMD   lanes  of   the  processor   and   also   easy  parallelization  of   the  workload  over  multiple   cores.   Each  subcarrier  is  executed  in  own  vector  lane,  the  algorithm  is  executed  for  32  subcarriers  in  parallel  per  core.  The  OpenCL  standard  contains  support  for  only  16-­‐wide  vector  data  types,  but  pocl  was  extended  to  support  up  to  32-­‐wide  vector  data  types.    The  group  of  32   subcarriers   that  execute   in  one  core  at   the   time   forms  one  single-­‐work-­‐item  OpenCL  work  groups.  Multiple   of   these  work   groups   execute   concurrently   on   the  multiple   cores  of   the  processor.   Pocl   contains   a   simple  work  queue  scheduler  where  each  thread  gets  a  new  work  group  to  execute  after  completing  the  first  one.  The  more  advanced  application  level  command  queue  runtime  reported  in  D4.4  was  not  utilized  in  this  case  as  the  focus  was  on  single  kernel  performance.  Changing  the  SIMD  width  for  different  versions  of  the  processor  requires  relatively  small  changes  to  the  program,  just  the  SIMD  data  types  need  to  be  changed  and  the  shuffle  intrinsic  calls  modified.      

Audio Processor The  algorithms  are  implemented  in  C  language.  C  was  selected  due  to  size  of  the  application  and  possibility  for  lower  level  control  than  OpenCL.    However,  the  same  support  for  vector  datatypes  that  is  available  in  OpenCL  C  was  used  via  a   compiler   extension  available   in   the  used  Clang   compiler.   This  was  done   to   achieve   code   that   can  explicitly   utilize  SIMD  units  without  using  intrinsics  or  other  less  portable  means.    

Machine Learning TTA The  processor  is  programmed  with  C.      

Page 14: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  14  of  27  

3. TUDelft  platform  

This   section   presents   the   quality   aspects   of   the   hardware   configurations   w.r.t.   performance   improvements,  power/energy  efficiency,  scalability,  and  usability  of  the  TUDelft  execution  platform.  The   TUDelft   platform   used   in   ALMARVI   is   built   around   the   rVEX   reconfigurable   VLIW   processor.   It   is   a   VHDL  implementation   of   the   VEX   architecture,   where   many   of   the   architectural   parameters   have   been   implemented   as  generic  and  some  as  parameters  that  can  be  changed  at  run-­‐time.  As  such,  the  processor  is  design-­‐time  configurable  and  run-­‐time  parametrizable.  The  architecture  itself  has  been  developed  targeting  media  applications  (image,  video),  and  related  architectures  (i.e.  the  Lx/st200  series  of  processors  by  STMicroelectronics)  have  found  widespread  usage  in  set-­‐top  boxes  and  related  media  devices.  In  contrast  to  the  commercial  st200  series,  the  rVEX  is  a  proof  of  concept  showing  the  possibilities  of  dynamic  (run-­‐time)  reconfigurations.  It  falls  under  the  “Liquid  Architectures”  research  theme  of  the  TU  Delft  Computer  Engineering  Laboratory.  The  final  aim  of  this  theme  is  to  develop  and  evaluate  a  platform  that  constantly  adapts  to  the  needs  of  the  workload.  This  means  that  it  must  be  able  to  provide  high  performance  for  single  threads  and  high  throughput  for  multiple   threads   (balancing   Instruction   Level   Parallelism   and   Thread   Level   Parallelism   –   ILP   and   TLP).   The   rVEX  platform  supports  this  through  the  dynamic  core  in  combination  with  a  dynamic  cache  system.    In  total,  the  rVEX  platform  consists  of  the  synthesizable  VHDL  designs  of  the  core,  cache  and  a  number  of  peripherals.  This  system  can  be  used  either  standalone  or  in  conjunction  with  the  GRLIB  library  to  create  a  SoC  with  DDR  controller  and   various   other   peripherals.   On   the   software   side,   there   is   an   interface   tool   that   can   connect   to   the   core   and  provides   the   user   with   extensive   debugging   capabilities   and   full   control   of   the   processor.   There   are   a   number   of  different  compilers  that  can  target  the  rVEX,  there  is  a  port  of  binutils  and  GDB,  there  is  basic  Linux  support  (uCLinux  with  a  NOMMU  kernel),   runtime   libraries   (the  uCLibc  C  standard   library  and  newlib)  and  an  architectural   simulator.  These   components  will   be   explained   in  more   detail   in   the   usability   section.   The   next   sections   discuss   a   number   of  improvements  we  have  developed  in  the  architecture  and  how  they  have  impacted  the  performance  of  the  platform.  

3.1. Performance  improvements  

Traditionally,  code  size  has  been  a  drawback  of  the  VLIW  design  philosophy.  As  many  of  the  techniques  that  are  used  to  increase  ILP  (e.g.  loop  unrolling)  increase  the  code  size,  this  metric  will  usually  not  compare  favorable  for  the  VLIW  in  relation  to  RISC  machines.  The  need  for  horizontal  NOPs  (No-­‐Operations  to  fill  unused  issue  slots  when  there  is  not  enough  ILP)  increases  this  difference  even  more.  The  result  is  that  VLIW  processors  usually  require  larger  caches  and  more  memory  bandwidth  to  perform  well.  By  implementing  a  new  VLIW  instruction  encoding,  the  performance  of  the  rVEX  processor  has  been  increased  by  up  to   a   factor   of   three   while   maintaining   compatibility   with   the   processor’s   dynamic   parametrizability,   as   has   been  published   in   [4].   This   has   been   achieved   by   removing   the   NOPs   from   the   binary,   which   dramatically   increases   the  effectiveness  of  the  instruction  caches  (as  can  be  seen  in  Error!  Reference  source  not  found.7).  Here,  the  speedup  is  depicted   when   comparing   processors   with   the   new   instruction   encoding   (stopbit)   to   the   old   instruction   encoding  (baseline).  The  right   figure  shows  the  differences   in  miss  rates  which  causes  the   improvement.  There  are  results   for  both   the   static   (design-­‐time   reconfigurable)   and   dynamic   (run-­‐time   reconfigurable)   versions   of   the   processor.   The  Powerstone  embedded  benchmark  set  has  been  used   for   the  evaluations.  The  dots  show  results   for  each   individual  benchmark,   the   lines   represent   the   average   for   the   entire   set.   The   average   speedup   is   highest   when   using   an  instruction  cache  size  of  4KiB  (factor  3).  The  speedup  decreases  for  larger  cache  sizes,  but  this  is  due  to  the  small  size  of   the   benchmarks   (the   code   sections   of   many   benchmarks   easily   fit   in   the   32   KiB   cache   regardless   of   the   used  encoding).      

Page 15: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  15  of  27  

   

Figure 7: Left, Speedups of the static and dynamic versions of the rVEX processor using the new instruction encoding. Right, Cache miss rate comparison of the old and new instruction encodings

 The  most  important  aspect  about  this  improvement,  is  that  it  leverages  components  in  the  processor  that  were  added  to   support   dynamic   reconfigurability.   In   other   words,   the   cost   of   adding   this   instruction   encoding   is   lower   for   our  dynamic   VLIW   than   for   normal   VLIWs.   The   result   is   that,  while   increasing   performance,   the   difference   in   area   and  power  utilization  between  a  static  VLIW  and  the  rVEX  (which  started  out  to  be  quite  considerable)  is  reduced  as  will  be  shown  in  the  next  section.  These  improvements  are  important  for  ALMARVI  because  it  means  that  the  size  of  the  instruction  memory  needed  to  achieve  a  certain  level  of  performance  can  be  reduced  greatly,  which  will  amount  to  a  large  difference  when  scaling  up  the  number  of  cores  in  a  platform.  

3.2. Power/energy  efficiency  As  the  memory  subsystem  utilizes  a  substantial   fraction  of  a   typical  system’s  energy,   increasing  the  effectiveness  of  the  caches  also  reduces  energy  utilization,  as  can  be  seen  in  Figure 8.  The  figure  depicts  the  energy  utilization  of  the  rVEX  core  with  caches  and  main  memory  running  the  same  (Powerstone)  benchmark  set.      

 Figure 8: Cache miss rate comparison of the old and new instruction encodings  The  most   interesting   result   is   that   the  difference   in  energy  utilization  between   the   static   and  dynamic   versions  has  been  decreased,  as  can  be  seen  in  9.  This  is  because  we  were  able  to  use  a  number  of  components  that  are  needed  for  dynamic  reconfigurability  to  support  the  sparse  encoding  scheme.    

●●

●●●●

●●

●●

●●

●●●●

●●

●●●●

●● ●

●●●●

●●

●●●●

●●

●●●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●

●●●●

●●

●●

●●●●●●●●●●●●●●

●●●●

●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●

●●●●1

2

4

8

16

1 2 4 8 16 32Cache Size (KiB)

Spee

dup

● ●Dynamic Core Static Core

Fig. 5. Speedup for stop-bit implementation for different instruction cachesizes. The lines represent the average speedup for a particular cache size.

with sparse instruction encoding than without. This is becauseat those cache sizes the entire application fits in the cache,and the reduction in cache misses is offset by the penaltyof having a longer branch delay. This could be remedied byinserting alignment NOPs to ensure that branch targets arealways aligned, at the cost of an increase in cache misses.

In the same figure we also observe that for very smallinstruction cache sizes, the speedup is not as significant asit is for intermediate sizes. This is caused by the fact that thereduction in cache misses for intermediate cache sizes is farlarger than for small cache sizes, as seen in Fig. 4.

Fig. 6 shows the normalized execution times for the dy-namic core. The baseline is the average of the worst executiontime of each application individually, executing on the dy-namic baseline design with a cache size of 1KiB. This figureshows, for instance, that for smaller cache sizes, the dynamicstop-bit implementation performs equivalent to the dynamicversion without stop-bit with a cache between 2 and 4 timeslarger.

D. Energy Results

Fig. 7 presents the total energy consumed by each ofthe benchmarks. The lines show the geometric mean of allapplications at each cache size. We can see that for smallcache sizes, due to the additional hardware required to supportreconfiguration, the dynamic core consumes more energy thanthe static core. However, at large cache sizes the differentdesigns are closer together in terms of energy consumption.

Fig. 8 depicts the energy consumption of the dynamiccore relative to that of the static core (values greater than1 mean that the static version consumes less energy than thedynamic one). We can see that the baseline dynamic designconsumes far more energy at small cache sizes, whereas when

Fig. 6. Normalized execution times for the dynamic core

Fig. 7. Energy consumption for each of the benchmarks at different cachesizes.

using sparse instruction encoding the designs consume similaramounts of energy.

As one can observe, the huge difference in energy consump-tion between the static and dynamic versions is significantlydecreased when using the proposed stop-bit approach. Mostnotably, both processors consume approximately the sameamount of energy at larger cache sizes. It means that one cantake advantage of all the adaptability that the dynamic versionprovides, with limited additional costs in terms of energy.

V. CONCLUSION

In this paper, we extended the stop-bit technique for sparseinstruction encoding to a dynamically reconfigurable VLIWprocessor. We showed that, by implementing this technique,

TABLE IITHE RESOURCE USAGE ON THE FPGA FOR THE DYNAMIC CORE WITH

AND WITHOUT STOP-BIT IMPLEMENTATION.

Resource Original Stop-bit Increase

Registers 30153 30537 1.3%Luts 61927 62379 0.7%BRAMs 125 125 0.0%

IV. RESULTS

We evaluated four different versions of the processors: staticbaseline, static with stop-bit, dynamic baseline, and dynamicwith stop-bit — all of them in their 4-issue configurations.We use the 4-issue configurations to provide a fair comparisonbetween the static and dynamic cores. The difference betweendynamic and static versions is that the binaries for the formerare compiled as generic binaries. We considered instructioncache sizes ranging from 1KiB to 32KiB. These sizes werechosen so that at the largest cache size each of the programsfits in the instruction cache entirely.

The designs are implemented in VHDL and prototyped ona Xilinx Virtex 6 FPGA (ML605 Development board). Withthese prototypes, we use performance counters to determinethe number of cache accesses, misses, and the number ofrunning cycles. The cache stall time is 16 cycles per 4-bytebus access. We use the Cadence Encounter RTL Compilerto obtain power dissipation in ASIC (Application SpecificIntegrated Circuit), using a 65nm CMOS cell library fromSTMicroeletronics. The energy consumption of the memorysubsystem was calculated with the Cacti Tool [18].

We use applications from the Powerstone benchmarks [19].All sources are compiled with the HP VEX compiler [20]and assembled with either the ⇢-VEX port of GNU as, or ourmodified version of the st200 assembler. The dynamic stop-bit versions are assembled with alignment turned off, so thatinstruction bundles are not padded at all. Since the processorlacks floating point operations, we use the floatlib libraryincluded with the HP VEX compiler (based on BerkeleySoftFloat [21]).

A. FPGA Resource usageTable II shows the resource usage of the dynamic core on the

FPGA. It shows that the increase is only 1.3% for the numberof registers and 0.7% for the number of lookup tables. As wewill show in the following sections, with this small increasein area we achieve significant improvements in performance,energy, and code size.

B. Code Size Reduction and Instruction Cache Miss RateIn Table III, we show the reduction in code size for each of

the 14 benchmarks used. We can see that the average reductionis around 50%. The reductions for the dynamic core in 8-wayconfiguration are included for reference, and are even moreextreme. These reductions will impact the cache behavior. InFig. 4, we show the cache miss rates for the two differentcores with and without sparse instruction encoding. The results

TABLE IIITHE CODE SIZE REDUCTION FOR EACH OF THE BENCHMARKS.

Program code size reductionstatic dynamic dynamic

4-way core 4-way core 8-way core

adpcm 49% 48% 73%bcnt 35% 38% 64%blit 47% 45% 67%

compress 53% 51% 74%crc 48% 48% 71%des 42% 44% 68%

engine 57% 54% 77%fir 60% 54% 76%

g3fax 58% 55% 76%jpeg 53% 51% 73%

pocsag 55% 51% 74%qurt 67% 65% 82%

ucbqsort 57% 54% 76%v42 56% 53% 75%

average 53% 51% 73%

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● 0.02

0.06

0.25

1.00

4.00

16.00

64.00

1 2 4 8 16 32Cache Size (KiB)

ICac

he M

iss

Rat

e (%

)

Dynamic Baseline

Static Baseline

Dynamic Stopbit

Static Stopbit

Fig. 4. Cache miss percentage for the dynamic and static cores, both withand without sparse instruction encoding for different instruction cache sizes.The dots represent the individual benchmarks, whereas the lines represent theaverage miss percentage for a particular configuration.

show that both designs achieve a similar reduction in cachemisses. In fact, with sparse instruction encoding the miss ratesare similar to those of canonical encoding with a cache almostfour times as large. This might seem like a larger improvementthan expected, since the code size was only reduced by half.However, because loops account for a majority of the executedinstructions, code size reduction that allow an entire loop bodyto fit into the cache will have a disproportionate impact on thecache miss rate.

C. Execution Time

Fig. 5 shows the speedup in execution time achieved forboth the dynamic and static cores. We can see that for largercache sizes, the execution time of some benchmarks is larger

Fig. 5. Speedup for stop-bit implementation for different instruction cachesizes. The lines represent the average speedup for a particular cache size.

with sparse instruction encoding than without. This is becauseat those cache sizes the entire application fits in the cache,and the reduction in cache misses is offset by the penaltyof having a longer branch delay. This could be remedied byinserting alignment NOPs to ensure that branch targets arealways aligned, at the cost of an increase in cache misses.

In the same figure we also observe that for very smallinstruction cache sizes, the speedup is not as significant asit is for intermediate sizes. This is caused by the fact that thereduction in cache misses for intermediate cache sizes is farlarger than for small cache sizes, as seen in Fig. 4.

Fig. 6 shows the normalized execution times for the dy-namic core. The baseline is the average of the worst executiontime of each application individually, executing on the dy-namic baseline design with a cache size of 1KiB. This figureshows, for instance, that for smaller cache sizes, the dynamicstop-bit implementation performs equivalent to the dynamicversion without stop-bit with a cache between 2 and 4 timeslarger.

D. Energy Results

Fig. 7 presents the total energy consumed by each ofthe benchmarks. The lines show the geometric mean of allapplications at each cache size. We can see that for smallcache sizes, due to the additional hardware required to supportreconfiguration, the dynamic core consumes more energy thanthe static core. However, at large cache sizes the differentdesigns are closer together in terms of energy consumption.

Fig. 8 depicts the energy consumption of the dynamiccore relative to that of the static core (values greater than1 mean that the static version consumes less energy than thedynamic one). We can see that the baseline dynamic designconsumes far more energy at small cache sizes, whereas when

Fig. 6. Normalized execution times for the dynamic core

Fig. 7. Energy consumption for each of the benchmarks at different cachesizes.

using sparse instruction encoding the designs consume similaramounts of energy.

As one can observe, the huge difference in energy consump-tion between the static and dynamic versions is significantlydecreased when using the proposed stop-bit approach. Mostnotably, both processors consume approximately the sameamount of energy at larger cache sizes. It means that one cantake advantage of all the adaptability that the dynamic versionprovides, with limited additional costs in terms of energy.

V. CONCLUSION

In this paper, we extended the stop-bit technique for sparseinstruction encoding to a dynamically reconfigurable VLIWprocessor. We showed that, by implementing this technique,

Page 16: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  16  of  27  

In  short,  each  pair  of  datapaths   (lanepair)  of   the  dynamic  core   is  able   to   function  as  a   full   separate  core  to  support  dynamic  reconfigurations.  Because  of  this,  each  lanepair   is  able  to  execute  the  full   instruction  set,   in  contrast  to  the  static   core,   where   each   datapath   is   specialized.   For   this   reason,   in   case   of   the   static   core,   instructions   need   to   be  forwarded  to  a  datapath  that  is  able  to  execute  them.  This  requires  additional  dispatch  circuitry  that  increases  energy  utilization  (this   is  why  the   line  that  depicts  the  energy  utilization  of  the  static  stopbit  design  to   increase  when  using  larger  cache  sizes,   crossing   the  other   lines).   In   the  dynamic  core,   this   instruction  dispersal   is  not  necessary  because  each   lane  can  execute  each   instruction.  The   locations  of   the   functional  units  with   lanepairs   is   coordinated  with   the  assembler.  The  result  is  that  the  overhead  of  adding  dynamic  reconfigurability  to  the  rVEX  is  reduced,  as  can  be  seen  in  Figure  9.      

 Figure  9:  Difference  in  energy  utilization  between  the  dynamic  and  static  rVEX  cores.  Without  the  new  instruction  encoding  scheme,  the  dynamic  core  used  up  to  a  factor  of  3.5  times  more  power  compared  to  a  static  core.  By  virtue  of  using  some  of  the  additional  logic,  that  is  needed  for  dynamic  reconfigurability,  for  the  new  encoding  scheme,  this  difference  has  been  reduced.    

3.3. Scalability    When  evaluating  scalability  of  the  platform,  two  factors  need  to  be  taken  into  consideration.    Firstly,   the   rVEX   is  a  proof  of  concept   that  has  never  been   taped-­‐out  yet.  A  project   in   this  direction   is   in   its  earliest  stages.  Therefore,  the  most  appropriate  area  utilization  results  come  from  FPGA  synthesis  tools.  ASIC  synthesis  tools  have  been  used  to  generate  estimations   (these  have  been  used  to  calculate  the  energy  utilization  figures),  but  until  the  chip  has  been  taped-­‐out  and  verified  working,  these  number  will  remain  estimations.    Secondly,   the   rVEX   processor   is   both   run-­‐time   parametrizable   and   design-­‐time   configurable   (using   VHDL   generics).  Therefore,   scalability   is  a  metric   in   the  design-­‐space  when  choosing   the   right  parameters   for   the  application.  When  creating   a   general-­‐purpose   platform   that   must   be   able   to   provide   high   performance   for   single   threads   and   high  throughput  for  multiple  threads,  the  full  dynamic  8-­‐issue  core  can  be  used.  However,   this  version  will  not  provide  a  large  degree  of   scalability   (approximately  64  datapaths  can   fit  on  a  Xilinx  VC707  FPGA  development  board).  On   the  other  hand,   if   the  workload   is  highly  parallelizable  and   scalability   is   an   important   requirement,   a   static  2  or  4-­‐issue  core  can  be  used.  The  area  utilization  is  considerably  lower,  and  single-­‐thread  performance  can  be  sacrificed  because  the  workload  will   rely   on  multithreading   to   achieve  high  performance.   In   this   case,   the   reduction   in   area   results   in  improved  scalability  (approximately  100  datapaths  can  fit  on  the  same  FPGA).  In  both  cases,  the  memory  hierarchy  is  not  considered  as  any  processor’s  scalability  will  be   impeded  by   the  ability  of   the  memory  to  provide  bandwidth  to  increasing  numbers  of  cores.    One  of  the  short  term  goals  for  2016  at  TU  Delft  is  to  design  a  platform  where  the  memory  structure  can  be  configured  at   design-­‐time   to   match   the   memory   access   patterns   of   some   of   the   ALMARVI   image   processing   algorithms.   This  structure   will   include   local   memories   whenever   possible,   so   cores   can   stream   data   between   computation   stages  without   needing   to   access   a   shared   bus.   The   expectation   is   that   this   memory   structure   will   improve   scalability  considerably.  

Page 17: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  17  of  27  

 

3.4. Usability  On  a  usability   level,   the   rVEX  platform  has  matured   immensely   since   the  start  of   the  ALMARVI  project.  This   section  gives  an  overview  of  the  different  aspects.  

Hardware In  2015,  a  new  core  has  been  redesigned  top-­‐down  with  the  following  requirements  (most  have  been  included  with  usability  in  mind):  

• Precise  trapping  and  interrupts    

• Advanced  debugging  hardware  and  software  

• Hardware  tracing  functionality  and  performance  counters  

• Core  structure  and  pipeline  organization  (easy)  design-­‐time  configurable  

• (Easy)  Design-­‐time  extensibility  (through  instruction  set  extensions)  

• Run-­‐time  parametrizable  core  (number  of  execution  lanes)  

• Dynamic  cache  that  supports  the  varying  number  of  execution  lanes  of  the  core  The   design   supports   the   ML605   and   VC707   FPGA   boards   and   can   be   synthesized   using   ISE   or   Vivado.   There   are  versions   with   and  without   bus   and   peripheral   (using   the   GRLIB   VHDL   library),   with   and  without   caches,   static   and  dynamic  cores.  All  of  these  options  are  easily  design-­‐time  configurable.    The  peripherals   from  GRLIB   that  are  used   in   the  platform  are   the   interrupt   controller,   timer,   framebuffer,   and  DDR  memory  controller.  The  core  is  able  to  utilize  either  UART  or  the  newly  created  PCI  express  interface  to  connect  to  a  host  machine.  This  is  depicted   in   Figure   10,   that   shows   the   system   components   including   the   additions   that   were   necessary   to   support  OpenCL  via  PCIe.  

 Figure 10: Components added to the rVEX platform to support the PCI express interface

 

Interface, Debug support Both   interfaces   are   supported   by   a   tool   that   is   able   to   access   the   core   for   various   debug   purposes   and   advanced  control  of  the  core.  The  tool  can  access  memory,  the  full  state  of  the  core  (including  the  general  purpose  register  file  and  a  wide  array  of  control   registers  such  as   the  Program  Counter)  and  all  of   the  debug   functionality.  The  platform  

Page 18: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  18  of  27  

supports   many   standard   debugging   features   such   as   breakpoints/watchpoints,   stepping,   register   and   memory  readouts.  Additionally,  a  program  can  be  traced  by  hardware,  where  all  relevant  information  of  every  execution  cycle  is  reported  to  the  host  machine  for  analysis.  These  traces  can  be  annotated  with  the  disassembly  of  the  program  to  monitor  the  full  execution  (instruction  fetch,  register  reads,  result  writeback,  cache  hits/misses,  etc.).  These  traces  can  for   example   be   used   to   compare   to   the   trace   output   of   an   architectural   simulator.   Lastly,   the   rVEX   interface   tool  supports  connections  from  GDB.  We  have  added  the  rVEX  architecture  to  GDB,  it  is  available  in  our  binutils-­‐gdb  port.  

Compilers, runtime, libraries There   are  multiple   compilers   available   on   the   rVEX   platform.   The   standard   compilers   are   HP   VEX   (a   closed   source  descendent  of   the  Multiflow  compiler),  GCC,  Open64   (opensource,  ported   to  st200  by  STMicroelectronics,  modified  for  VEX),  LLVM  and  CoSy.  There   are   currently   2   possible   choices   for   the   run-­‐time   environment   (besides   running   bare-­‐metal);   uCLinux   with  uCLibc,  and  newlib  with  a  compile-­‐time  generated  filesystem.  Newlib  can  be  used  to  run  large  programs,  such  as  the  SPEC  benchmark  suite,  as  long  as  the  in  and  output  files  and  their  sizes  are  known  at  compile-­‐time.  uCLinux  (Linux  for  microcontrollers,  a  distribution  of  Linux  with  a  nommu  kernel)  has  a  filesystem  size  limitation  of  4MiB  because  we  are  using  a  RAMdisk.    Currently  supported   libraries  are  a  basic  math   library   included   in  newlib  and  a   floating  point   library.   It  has  a  decent  performance  of  23  cycles  for  a  floating  point  multiplication  on  a  4-­‐issue  rVEX  core.  However,  the  rVEX  does  not  target  floating  point  workloads  and  the  image  processing  algorithms  used  within  the  context  of  the  ALMARVI  project  will  all  be  converted  to  fixed  point  before  targeting  the  rVEX  platform.  

OpenCL Support Figure  91  shows  a  graphical  representation  of  the  software  stack  to  support  OpenCL  on  the  rVEX  platform  using  pocl.  The  rVEX  device  layer  has  been  added  to  the  pocl  project.  It  uses  the  newly  developed  rvex  and  xdma  driver  to  connect  to  the  hardware.  This  setup  can  also  be  used  on  the  ZYNQ  platform  (using  AXI  instead  of  PCIe),  so  it  can  be  included  in  the  demonstrator  setup.  

Figure 91: OpenCL support on the rVEX platform  

Page 19: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  19  of  27  

4. UTIA  platform  

Performance   improvements   and   scalability   of   Full   HD   video   processing   algorithm   will   be   presented   on   a   motion  detection  algorithm.    The  device  works  with  Full  HD  color  video  sensor  and  stores  in  memory  two  subsequent  video  frames.  The  algorithm  performs  edge  detection  on  both  frames.  Two  Sobel  filters  are  used.  Algorithm  computes  the  difference  of  the  output  of  both  algorithms  to  detect  the  moving  edges.  These  moving  edges  are  filtered  by  a  median  filter  to  remove  the  noise.  The  filtered  edges  are  marked  by  red  color  and  displayed  together  with  the  original  content  of  the  video  frame  on  the  Full  HD  HDMI  display.      The  algorithm  is  designed  debugged  and  tested  first  on  ARM  Cortex  A9  processor  (666  MHz).  Processor  is  capable  to  compute   only   one   frame   per   second   with   maximal   (03)   optimization   and   the   ycbcr   format   (16bit   per   pixel)   of  representation  of  data.    The  application  requires  to  compute  the  moving  edge  detection  with  the  Full  HD  video  frame  rate  60  FPS.    This  indicates  need  to:    

• Accelerate  60  times  • Scale  the  computation  on  multiple  boards  to  reach  this  requirement  (60  FPS)  • Reach  (if  possible)  the  required  60  FPS  with  improved  quality  (by  higher  precision  RGB  format  24bit  per  pixel)    • Use  the  Software  defined  (C/C++)  description  of  the  algorithm  from  the  initial  working  (but  slow)  ARM  Cortex  

A9  implementation      To   reach   this   goal  we  build   on   concepts   and   tool   chains   for   automatic   generation   of  HW  accelerators   described   in  deliverable  D3.2  for  the  UTIA  platform.  The  used  tool  chain  is  briefly  summarized  now.      Generation  of  accelerators   is  based  on  automatic  compilation  of  C/C++  functions  by  Xilinx  high   level  synthesis   (HLS)  compiler   2015.4   into   IP   cores   for   the   programmable   logic   of   ZYNQ   devices.   The   Xilinx   SDSoC   environment   2015.4  environment  is  complementing  the  HLS  generated  IP  cores  with  data  movers  serving  for  DMA  transport  of  data  from  DDR3  to  programmable   logic.  This  compilation  can  be  done   in   the  Xilinx  SDSoC  environment  2015.4  as  described   in  D3.2  for  the  UTIA  platform.    

4.1. Performance  improvements  

In  case  of   the  Full  HD  motion  detection  algorithm  we  have   to  accelerate  approximately  60   times   the  666MHz  ARM  processor  on  the  programmable  logic  part  (PL)  of  the  Xilinx  ZYNQ  device.  The  PL  logic  can  implement  IP  cores  clocked  at   150  MHz.   It   is   clear   that   the   implementation   has   to   relay   on   parallel   processing   capabilities   of   HW   and   also   on  chaining  of  accelerators  to  achieve  parallel  pipelined  computation.    The   Xilinx   SDSoC   (LLVM   based)   compiler   performs   source   code   transformation   replacing   pointer   arguments   of  functions   with   new   interfaces   controlling   the   auto-­‐generated   datamover   IPs.   The   datamovers   control   the  autogenerated   HW   DMA   engines   connected   to   the   high   performance   ports   of   the   ARM   Cortex   A9   programmable  subsystem  of   the   ZYNQ  device.   The  DMA  access   to   the  DDR3   is   supported  by   the  multiport  DDR3   controller  of   the  ZYNQ  device.    

Page 20: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  20  of  27  

Figure 12 Concept of generation accelerated HW in SDSoC environment for the Almarvi platform

Figure  2   is  explaining  how  the  Xilinx  SDSoC  2015.4  HW  platform  as   it   is  used   in  Almarvi.   It   is   reproduced   from  D3.2  deliverable  for  easier  access  here.  The  I/O  Interface  IPs  have  to  be  prepared  by  HDL  HW  designer  in  advance  for  the  concrete  HW  system.  In  case  of  the  Almarvi  Full  HD  Video  image  sensor  platforms  the  “Interface  IPs”  transport  data  from  the   input  Full  HD   image  sensor  to  the   input  DDR3  video  frame  buffer  and  also  transport  data  from  the  output  DDR3  video   frame  buffer   to   the   full  HD  HDMI  output.     See  Figure  103.  Details   about  export  of   this  platform   to   the  Xilinx  SDSoC  environment  are  described  in  D3.2.    

 Figure 103 Almarvi vita-hdmio platform for the SDSoC environment

Platform  described   in  Figure  103  decuples   the  Cortex  A9  ARM  computation   from  the  demanding  Full  HD   resolution  video  data  stream   I/Os.  Software  debugging  can  be  done  on  ARM  while   the  real-­‐time  Video   in  and  out  streams  are  processed  by  the  vita-­‐hdmio  platform  HW.  See  Figure  103.  Figure  114  is  describing  the  motion  detection  algorithm  as  described  in  the  initial  section  of  this  chapter.    

Page 21: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  21  of  27  

Figure 114 Motion detection algorithm is highlighting the moving edges in Full HD Video stream coming from the video sensor

4.2. Power/energy  efficiency  

Figure 125 Accelerated motion detection algorithm implemented by UTIA

Figure   125   is   presenting   the   running   edge   detection   algorithm   implemented   first   in   Arm   and   next   on   the   SDSoC  generated  accelerator  chain  presented  in  Figure  114.      ARM  (666MHz)   See  top  SW  path  in  Figure  114.:              1,17  FPS  

Page 22: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  22  of  27  

SDSoC  generated  accelerators  (150MHz)  See  HW  path  in  Figure  114.:   36,95  FPS  This  is  acceleration  31  times.    Energy  needed  by  the  complete  board  to  compute  the  Full  HD  motion  detection  frame  is  reduced  30  times.  This  is  still  not  sufficient,  for  the  60  FPS  requirement.  

4.3. Scalability  We  have  performed  additional  design  exploration  in  the  in  the  Xilinx  SDSoC  2015.4  environment.  We  have  found  that  the  ZYNQ  PL  part  can  accommodate  two  parallel  HW  chains  chain  presented  in  Figure  114.      2  parallel  chains  of  SDSoC  generated  accelerators  (150MHz)    deliver:   57,09  FPS  This  is  acceleration  48  times.    Energy  needed  by  the  complete  board  to  compute  the  Full  HD  motion  detection  frame  is  reduced  45  times.    This  is  still  not  sufficient,  for  the  60  FPS  requirement.  The  utilization  of  PL  slices  is  close  to  100%  and  the  is  valid  for  the  16  bit  per  pixel  ycrcb  data  representation  as  defined  by  the  Almarvi  platform  vita-­‐hdmio  defined  in  Figure  103.  The  solution  with  improved  precision  and  RGB  24bit  per  pixel  data  representation  would  not  fit  to  the  device  in  case  of  2  parallel  chains  of  SDSoC  generated  accelerators  (150MHz).      To  reach  the  required  60  FPS  we  have  to  scale  up  to  computation  on  2  boards.      To  reach  this  goal,  we  have  created  the  Almarvi  hdmii-­‐hdmio  platform  for  the  second  board.  See  Figure  136.  

Figure 136 Almarvi hdmii-hdmio platform for the SDSoC for serial scalability of computation on chain of boards with communication based on the Full HD HDMI standard

We  have  also  created  two  new  Almarvi  platforms  vita-­‐rgb-­‐hdmio  (Figure  147)  and  hdmii-­‐rgb-­‐hdmio  (Figure  158)  supporting  the  RGB  24bit  per  pixel  format  fir  the  video  data.      

Page 23: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  23  of  27  

Figure 147 Almarvi vita-rgb-hdmio platform with extended RGB 24 bit precision

Figure 158 Almarvi hdmii-rgb-hdmio platform with extended RGB 24 bit precision for serial scalability of computation on chain of boards with communication based on the Full HD HDMI standard

The  data  motion  algorithm  has  been  is  slightly  modified  for  each  board.    

• The  first  board  with  the  video  sensor  computes  the  upper  50%  of  each  frame  and  performs  copy  of  the  unmodified  lower  50%  of  each  frame.  See  Figure  16.  

• First  board  is  capable  to  compute  its  upper  50%  part  of  each  frame  at  the  required  60  FPS.  • The  second  board  receives  from  the  first  board  Full  HD  frames  with  60  FPs  the  upper  50%  of  each  frame  is  

already  done.  The  board  performs  copy  of  this  upper  region  without  modification  to  the  output  frame.  Board  computes  the  lower  50%  of  each  frame,  also  with  the  required  with  60  FPS  speed.  See  Figure  .  

 

Page 24: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  24  of  27  

Figure 169 Motion detection algorithm scaled for serially connected boards – Algorithm in the first video sensor board with the Almarvi vita-hdmio platform or vita-rgb-hdmio platform (with extended precision)

Figure 20 Motion detection algorithm scaled for serially connected boards – Algorithm in the second board with the Almarvi vita-hdmio platform or vita-rgb-hdmio platform (with extended precision)

4.4. Usability  We  have  scalable  system  which  can  be  reasonably  used  now.  See  Figure  17.      

• It  is  2  board  system  connected  serially  via  the  Full  HD  HDMI  standard  • The  scaled  up  2-­‐board  system  have  met  the  60  FPS  requirements  for  the  Full  HD  motion  detection  algorithm.  • Boards  have  sufficient  PL  area  reserve  for  both  data  formats,   the  compact  ycrcb  (16  bps)  as  well  as   for  the    

RGB  (24  bps)  data  representation.      • Scaled  system  accelerates  60  times  as  required.  • Energy   needed  by   the   scaled   2-­‐board   solution   to   compute   the   Full   HD  motion   detection   frame   is   reduced    

30  times  in  comparison  to  the  single  board  SW  solution  on  Arm.    • Power  consumption  and  cost  of  the  scaled-­‐up  2-­‐board  system  has   increased  2x   in  comparison  to  the  single  

board  solution  (single  board  6W;  scaled-­‐up  2-­‐board  syystem  –  12W).    • The   glass   to   glass   latency   (from   the   video   sensor   to   the   display)   has   increased   2x   due   to   the   pipelined  

processing.  • Scaled  up  system  remains  manageable  and  it  can  be  supported  by  the  automatic  generation  of  accelerators  

as  described  in  the  Almarvi  deliverable  D3.2.  

Page 25: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  25  of  27  

Figure 171 Motion detection algorithm scaled from single board to two serially chained boards is meeting the Full HD requirement resolution and the 60 frames per second performance.

 The  flower  movement  is  detected  by  the  scaled  up  algorithm  working  on  the  two  serially  connected  ZYNQ  boards.  It  is  marked  in  the  figure  as  the  red  edges  of  moving  parts  of  the  flower.  See  Figure  171.    These  packages  have  been  prepared   in  UTIA   for   the  Xilinx   SDSoC  2015.4  environment   and   the   ZYNQ   te0720-­‐02-­‐2IF  system  on  module  on  the  te701-­‐05  carrier  board  (see  Figure  171):  

• vita-­‐hdmio  with   support   for   the   Full  HD   (60   FPS)   color   Vita   2000   video   sensor   and   Full  HD   (60   FPS)  HDMI  output)  with  internal  video  data  representation  as  ycrcb  (16  bits  per  pixel).  It  is  supporting  the  Imageon  FMC  board.  

• hdmii-­‐hdmio  with  support  for  Full  HD  (60  FPS)  HDMI  input  with  internal  video  data  representation  as  ycrcb  (16  bits  per  pixel).  It  is  supporting  the  Imageon  FMC  board.  

• vita-­‐rgb-­‐hdmio  with  support  for  the  Full  HD  (60  FPS)  color  Vita  2000  video  sensor  and  Full  HD  (60  FPS)  HDMI  output)  with  internal  video  data  representation  as  RGB  (24  bits  per  pixel).  It  is  supporting  the  Imageon  FMC  board  for  the  Video  sensor.  It  is  supporting  the  Digillent  FMC  board  for  HDMI-­‐input  and  the  te701-­‐05  carrier  HDMI  output.  

• hdmii-­‐rgb-­‐hdmio  with   support   for   Full   HD   (60   FPS)  HDMI   input  with   internal   video   data   representation   as  ycrcb   (16   bits   per   pixel).   It   is   supporting   the  Digillent   FMC  board   for  HDMI-­‐input   and   the   te701-­‐05   carrier  HDMI  output.  

The  derived  solution  is  usable  and  scalable  with  reasonable  additional  effort.    See  Figure  171.  

Page 26: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  26  of  27  

5. Conclusions  

The   ALMARVI   executions   platforms   cover   a   wide   spectrum   of   flexibility   and   customizability   for   the   ALMARVI  applications.  This  deliverable  described  the  various  quality  aspects  of  the  three  ALMARVI-­‐specific  hardware  platform  configurations   developed   by   Nokia,   TUT   and   UTURKU;   by   the   TUDelft;   and   by   UTIA.   Four   quality   aspects   for   each  platform  have  been  discussed:  performance  improvements,  power/energy  efficiency,  scalability,  and  usability.   In  the  following,  a  number  of  these  aspects  are  discussed  for  a  couple  of  applications  on  the  platforms  to  show  the  ALMARVI  targets  that  have  been  achieved  so  far.    Nokia/TUT/UTURKU  platform  With  regards  to  performance,  the  processor  core  designed  within  the  ALMARVI  project  (called  LordCore)  is  based  on  the  Transport  Triggered  Architecture  (TTA)  paradigm.  To  enable  high-­‐performance  computation,  it  contains  a  512-­‐bit  wide  SIMD  datapath,  which  is  able  to  calculate  32  lanes  of  16-­‐bit  wide  half-­‐precision  floating-­‐point  values  in  parallel.  The   architecture   is   also   scalable   to   other   SIMD   widths:   8-­‐   and   16-­‐lane   versions   were   also   developed   for   lower  performance  usage  such  as  for  LTE-­‐M  applications.  With  regard  to  efficiency,   the  two-­‐core  version  of  the  processor   for  the  LTE  receiver  was  synthesized  with  Synopsys  Design   Compiler   version   I-­‐2013.12,   using   28nm  Fully  Depleted   Silicon-­‐On-­‐Insulator   (FDSOI)   process   technology.   The  estimated  power  consumption  for  the  two-­‐core  version  is  137  mW  (MMSE)  and  163  mW  (LORD),  which  are  well  below  the  targeted  1W  performance  boundary.  Extrapolated  power  consumption  for  a  four-­‐core  version  is  about  270  mW,  which  would  deliver  the  targeted  LTE  device  category  11  performance  of  600  Mbit/s.  For  the  audio  signal  processor,  the   estimated   power   consumption   for   two-­‐core   version   is   300   µW,   which   is   well   below   our   targeted   1mW   upper  boundary.  With  regard  to  scalability,  the  architecture  of  the  processor  was  designed  to  be  scalable  in  various  ways.  Most  notably,  the  memory   architecture   allows   both   efficient   scaling   to  multiple   cores,   as  well   as   to  multiple   SIMD  widths   of   the  processor,  and  the  number  of  functional  units.  With  regard  to  usability,  the  platform  can  programmed  using  the  OpenCL  language  for  the  LTE  receiver.  OpenCL  allows  using   vector   data   types   to   execute   the   same   code   on   many   SIMD   lanes   of   the   processor,   and   also   allows   easy  parallelization  of   the  workload  over  multiple  cores.  The  audio  processor,  on  the  other  hand,  can  be  programmed   in  the  C  language.  C  was  selected  due  to  the  size  of  the  application  and  possibility  for  lower  level  control  than  OpenCL.    TUDelft  platform  By  implementing  a  new  VLIW  instruction  encoding,  the  performance  of  the  rVEX  processor  has  been  increased  by  up  to  a  factor  of  three  while  maintaining  compatibility  with  the  processor’s  dynamic  parametrizability.  With   regards   to   scalability,   the   rVEX  processor   is  both   run-­‐time  parametrizable  and  design-­‐time  configurable   (using  VHDL   generics).   Therefore,   scalability   is   a   metric   in   the   design-­‐space   when   choosing   the   right   parameters   for   the  application.  On  the  usability   level,   the  rVEX  platform  has  matured   immensely  since  the  start  of   the  ALMARVI  project.   In  2015,  a  new  core  has  been  redesigned  with  various   improvements:  1)  advanced  debugging  hardware/software,  2)  hardware  tracing   functionality  and  performance  counters,  3)   core  structure  and  pipeline  organization  are  now   (easily)  design-­‐time   configurable,   4)   core   is   now   (easily)   design-­‐time   extensible,   5)   number   of   execution   lanes   are   run-­‐time  parametrizable,  and  6)  dynamic  cache  that  supports  the  varying  number  of  execution  lanes  of  the  core.    UTIA  platform  With  regard  to  performance  and  to  allow  fro  real-­‐time  processing,  the  Full-­‐HD  motion  detection  algorithm  had  to  be  accelerated   approximately   60   times   using   the   666MHz   ARM  processor   on   the   programmable   logic   part   (PL)   of   the  Xilinx  ZYNQ  device.  The  PL  logic  can  implement  IP  cores  clocked  at  150  MHz.  UTIA   performed   design   exploration   in   the   in   the   Xilinx   SDSoC   2015.4   environment.   Two   parallel   chains   of   SDSoC  managed  to  generate  accelerators  (at  150MHz)  delivering  57.09  FPS.  This  represents  an  acceleration  of  48  times.  The  energy  needed  by  the  complete  board  to  compute  the  Full-­‐HD  motion  detection  frame  was  reduced  45  times.  

Page 27: ALMARVI D3.5 final v10almarvi.eu/assets/almarvi_d3.5_final_v10.pdf · ALMARVI D3.5 final v10 ... 1 -..

D3.5  –  Scalability,  quality  and  usability  of  the  execution  platform   ARTEMIS  JU  Grant  Agreement  n.  621439  

rev  1.0  -­‐  Confidential   ©ALMARVI  Consortium   Page  27  of  27  

6. References  

[ 1] T. Nylanden, J. Janhunen, O. Silven, and M. Juntti, “A gpu implementation for two mimo-ofdm detectors,” in Embedded Computer Systems (SAMOS), 2010 International Conference on, July 2010, pp. 293–300.

[ 2] M. Wu, B. Yin, and J. Cavallaro, “Flexible n-way mimo detector on gpu,” in Signal Processing Systems (SiPS), 2012 IEEE Workshop on, Oct 2012, pp. 318–323.

[ 3] M. Wu, Y. Sun, S. Gupta, and J. R. Cavallaro, “Implementation of a high throughput soft mimo detector on gpu,” J. Signal Process. Syst., vol. 64, no. 1, pp. 123–136, Jul. 2011. [Online]. Available: http://dx.doi.org/10.1007/s11265-010-0523-4

[4] A. Brandon, J. Hoozemans, J. van Straten, A. Lorenzon, A. Sartor, A. C. S. Beck, and S. Wong, “A sparse vliw instruction encoding scheme compatible with generic binaries,” in 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Dec 2015, pp. 1–7