ENG6530 Reconfigurable Computing Systems High Level Languages High Level Languages “Electronic...

ENG6530 Reconfigurable

Computing Systems

High Level LanguagesHigh Level Languages

““Electronic System Level (ESL) Electronic System Level (ESL) Design”Design”

ENG6530 RCS 2

Topics

Issues with Reconfigurable ComputingIssues with Reconfigurable Computing Complexity of Applications …Complexity of Applications … Complexity of the Design CycleComplexity of the Design Cycle

Electronic System Level (ESL)Electronic System Level (ESL) Motivation, Why?Motivation, Why? Advantages/DisadvantagesAdvantages/Disadvantages SummarySummary

ENG6530 RCS 3

ReferencesReferences ““Reconfigurable Computing: The Theory & Practice of FPGA Based Reconfigurable Computing: The Theory & Practice of FPGA Based

Computing”, by S. Hauck and A. Dehon, 2008.Computing”, by S. Hauck and A. Dehon, 2008. ““Leading Languages: Is There a Future Beyond RTL”, FPGA Journal Leading Languages: Is There a Future Beyond RTL”, FPGA Journal

2005.2005. ““The Challenges of Synthesizing Hardware from C-like Languages”, The Challenges of Synthesizing Hardware from C-like Languages”,

by Stephen Edwards.by Stephen Edwards. ““Design of a high-level language for Custom Computing Machines”, Design of a high-level language for Custom Computing Machines”,

C. Van Reeuwijk, 2002.C. Van Reeuwijk, 2002.

“Comparison of VHDL, Verliog and SystemVerliog”, Stephen Bailey, Model Technology.

http://www.SystemC.org (System-C) http://www.celoxica.com (Handel-C) http://www.mentor.com (Catapult-C) http://www.xilinx.com (AutoESL)

ENG6530 RCS 4

Key Markets for HPCKey Markets for HPCHow are we going to manageHow are we going to manage

Design Complexity?Design Complexity?

ENG6530 RCS 5

One important practical approach to handle complexity is to raise the level of raise the level of abstractionabstraction

We can take guidance from previous shifts previous shifts in methodologyin methodology which raised the level of abstraction

- from schematics to HDLs - from assembler code to HLLs

Managing ComplexityManaging Complexity

ENG6530 RCS 6

Complexity of Design CycleComplexity of Design Cycle

ENG6530 RCS 7

1. You can’t get your hardware designs done quickly enough Designs are getting too complex to handleDesigns are getting too complex to handle (SOC)

2. You haven’t enough experienced hardware designers

3. Errors in design or unimplemented features cost $

4. ASICs and development tools costly $costly $

5. Software development stalls waiting for the hardwarestalls waiting for the hardware

Why do companies face these Why do companies face these problems?problems?

ENG6530 RCS 8

A need for a new design LanguageA need for a new design Language

Verilog and VHDL work very well for HW implementation flows but …but …

1. They are too complicated for casual use.2. Systems are becoming more complex, pushing us to

design and verify at higher levels of abstraction.3. Designers often implement today’s systems as a mix of

hardware and software (which should be Hw/Sw?) It is essential that new design flows support early

software development, integration with existing C/C++ code, and HW/SW co-designHW/SW co-design. Using a single language Using a single language like C simplifies migration task!like C simplifies migration task!

4. If we synthesize hardware from C like languages we can effectively turn every C programmer into a hardware designer!!

High-level SynthesisHigh-level Synthesis

Wouldn’t it be nice to write high-level code? Ratio of C to VHDL developers (10000:1 ?) + Easier to specify + Separates function from architecture

+ More portable - Hardware potentially slower

Similar to assembly code era Programmers could always beat compiler But, no longer the case

Hopefully, high-level synthesis will catch up to manual effort

ENG6530 RCS 10

Abstraction: AdvantagesAbstraction: Advantages

ENG6530 RCS 11

Why not a Software Language for Why not a Software Language for Design Entry??Design Entry??

The semantics of “C” and similar languages are distant enough from hardware (Execution Models!!) Software follows a sequential model Hardware is fundamentally concurrent.

C language has no support for user specified parallelism So either the synthesis tool must find it a difficult task Or the designer must use language extensions and insert explicit

parallelism (programmer will have to think differently to design hardware).

Techniques for synthesizing hardware from C either generate inefficient hardware or propose a language that merely adopts parts of C syntax.

ENG6530 RCS 12

Advantages of HLLs for Hardware DesignAdvantages of HLLs for Hardware Design

1. Designs are often specified by a C/C++ executable Some problems are better expressed as a software algorithm Software Reference designs can be utilized

2.2. Enables much higher speed Enables much higher speed verificationverification Faster Simulation Faster Simulation at architecture level than gate level Reduce RisksReduce Risks by enabling early verification of the entire system.

3. Software development techniques can be used4. Simplifies hardware-software partitioning5. Brings hardware and software teams closer together

ENG6530 RCS 13

Requirements for New Language?Requirements for New Language? Don’t invent a new language! Build on C/C++ so that:

Extensive C/C++ infrastructure (compilers, debuggers, language standards, books, e.t.c.) can be re-used.

Users’ existing knowledge of C/C++ can be leveraged. Integration with existing C/C++ code is easy

It must support specification and refinementsupport specification and refinement to detailed implementation of both software and hardware.

It must support verificationsupport verification through all stages of the design process.

It must provide a very general set of modeling constructsprovide a very general set of modeling constructs to cleanly support the wide range of abstraction levels and models of computation used in system design.

ENG6530 RCS 14

Semiconductor DesignSemiconductor Design

HandCrafted

SchematicCapture

VHDL /Verilog

SystemLevel

Design

1970’s 1980’s 1990’s 2000’s

In house

Cut rubies(manual)

DaisyMentorValid

CalmaInternal

SynopsysCadenceMentor

DraculaCadenceAvant!

FRONTEND

BACKEND

Handel-C

SystemC

SystemVerilog

CatapultC

ImpulseC

AutoESL

ENG6530 RCS 15

Ease of Use vs. EfficiencyEase of Use vs. Efficiency

Easy Ease of Use Difficult

Verilog

SystemCHandel-C

SystemVerilog

CatapultC

ImpulseC

Vivado HLS

Contrasting ESLs

Handel-C VIVA Mitrion C

Hardware Software

HDL Impulse-C

VIVADO HLSSystemC

Explicit Par StatementsMemory StatementsChannels, …

Pure C/C++ statementswith Pragmas inserted

C to FPGA Accelerated System

Algorithm Design

Function & Architecture

Implementation

Mixed Simulation

C for HWCA

C/C++AL

API’s/Libraries

Processor

SoftwareModel

Specification Model

TestbenchDesign

Partitioning

System Model

Design AnalysisOptimization

P&RSynthesis

C-Based Synthesis

ArchitectureExploration

BSPBSP

Commercial RC Applications

Well established in embedded systems:

Digital Video Technology and Image Processing “PROCESSING AT THE SENSOR” versus local and/or remote processing 3D LCD display development and test Real-time verification of HDTV image processing algorithms Robust image matching - product tracking and production line control

Digital Signal Processing Engine control unit for 3-phase motors Radar and sonar beam forming and spatial filtering Computer aided tomography security system

Communications and Networking

Internet reconfigurable multimedia terminal, MP3, VoIP etc. Ground traffic simulation test bed for broadband satellite network communications Satellite based Internet data tracking system

Rapid Systems Prototyping Automotive safety system incorporating sensor fusion Robotic vision system for object detection and robot guidance

Defense & Security

Consumer

Automotive & Industrial

…using C-based design

ENG6530 RCS 19

SummarySummary Systems are too complicatedtoo complicated today to rely on Hardware

Descriptive Languages such as VHDL or Verilog. New Languages have emergedNew Languages have emerged such as SystemC,

Handel-C, CatapultC, ImpulseC, … Some of these languages are

Suitable for system verification (speedup the simulation of the system).

Suitable for synthesis Suitable for architecture exploration Suitable for Hardware/Software Co-design

Challenges: Efficiency of synthesizers (Performance, Area, Power) Learning curve

ENG6530 RCS 20

Computing Systems

High Level SynthesisHigh Level Synthesis

ENG6530 RCS 21

Design Entry

Logic Optimization

Synthesis

Mapping to k-LUT

Packing LUTs to CLBs

Placement

Routing Configure an FPGA

Simulation

CAD for FPGAs: Synthesis

FPGA Tool Flow with ESLFPGA Tool Flow with ESL

Netlist

Bitfile

Processor FPGA

RT Synthesis

Physical Design

Technology Mapping

Placement

Routing

High-level Synthesis

C/C++, Java, etc.

High Level Synthesis

ConstraintsAreaTime: Clock Period Nr. of clock stepsPower

LibraryWHILE G < K LOOP F := E*(A+B); G := (A+B)*(C+D);END LOOP;

Algorithm

A C B D EX

Datapath

Latches

Controller

High-level SynthesisHigh-level Synthesis• First, consider how to manually convert high-level code into First, consider how to manually convert high-level code into

circuitcircuit

• StepsSteps– 1) 1) Build FSM for controllerBuild FSM for controller

– 2) 2) Build datapath based on FSMBuild datapath based on FSM

acc = 0;for (i=0; i < 128; i++) acc += a[i];

Manual ExampleManual Example• Build a FSM (controller)Build a FSM (controller)

– Decompose code into statesDecompose code into states

acc = 0;for (i=0; i < 128; i++) acc += a[i];

if (i < 128)

acc=0, i = 0

load a[i]

acc += a[i]

Manual ExampleManual Example• Build a datapathBuild a datapath

– Allocate resources for each stateAllocate resources for each state

if (i < 128)

acc=0, i = 0

load a[i]

acc += a[i]

addra[i]

1 128 1

acc = 0;for (i=0; i < 128; i++) acc += a[i]; 26

– Determine register inputsDetermine register inputs

if (i < 128)

acc=0, i = 0

load a[i]

acc += a[i]

addra[i]

In from memory

acc = 0;for (i=0; i < 128; i++) acc += a[i];

– Add outputsAdd outputs

if (i < 128)

acc=0, i = 0

load a[i]

acc += a[i]

addra[i]

In from memory

Memory addressacc

acc = 0;for (i=0; i < 128; i++) acc += a[i]; 28

– Add control signalsAdd control signals

if (i < 128)

acc=0, i = 0

load a[i]

acc += a[i]

addra[i]

In from memory

Memory addressacc

acc = 0;for (i=0; i < 128; i++) acc += a[i]; 29

Manual ExampleManual Example• Combine controller+datapathCombine controller+datapath

addra[i]

In from memory

Memory addressaccDone Memory Read

Controller

acc = 0;for (i=0; i < 128; i++) acc += a[i]; 30

Manual ExampleManual Example

• Comparison with high-level synthesisComparison with high-level synthesis– Determining when to perform each operationDetermining when to perform each operation

• => Scheduling=> Scheduling

– Allocating resource for each operationAllocating resource for each operation

• => Resource allocation=> Resource allocation

– Mapping operations onto resourcesMapping operations onto resources

• => Binding=> Binding

ENG6530 RCS32

Behavioral SynthesisBehavioral Synthesis

Algorithm

I/O Behavior

Target Library

Behavioral Synthesis

RTL Design

LogicSynthesis

Gate level Netlist

Classic RTL Design Flow

• Resource Allocation

• Scheduling

• Binding

HLS: Main StepsHLS: Main Steps

Syntactic Analysis

Optimization

Scheduling/Resource Allocation

Binding/Resource Sharing

High-level Code

Intermediate Representation

Controller + Datapath

Converts code to intermediate representation - allows all following steps to use language independent format.

Determines when Determines when each operation will execute, and resources usedresources used

Maps operations onto physical resources

Front-end

Back-end

Intermediate RepresentationIntermediate Representation

• Parser converts tokens to intermediate representationParser converts tokens to intermediate representation– Usually, an abstract syntax treeUsually, an abstract syntax tree

x = 0;if (y < z) x = 1;d = 6;

Assign

cond assign assign

x 1 d 6y < z

Intermediate RepresentationIntermediate Representation

• Why use intermediate representation?Why use intermediate representation?– Easier to analyze/optimize than source codeEasier to analyze/optimize than source code– Theoretically can be used for all languagesTheoretically can be used for all languages

• Makes synthesis back end language independentMakes synthesis back end language independent

Syntactic Analysis

C Code

Intermediate Representation

Syntactic Analysis

Back End

Scheduling, resource allocation, binding, independent of source language - sometimes optimizations too

SchedulingScheduling

• Scheduling assigns a start time to each Scheduling assigns a start time to each operation in DFGoperation in DFG– Start times must not violate dependencies in DFGStart times must not violate dependencies in DFG– Start times must meet performance constraintsStart times must meet performance constraints

• Alternatively, resource constraintsAlternatively, resource constraints

• Performed on the DFG of each CFG nodePerformed on the DFG of each CFG node– => Can’t execute multiple CFG nodes in parallel=> Can’t execute multiple CFG nodes in parallel

Scheduling ExamplesScheduling Examples

a b c d

Cycle1

Cycle2

Cycle3

a b c d

Cycle3

Cycle1 Cycle2

a b c d

Cycle1

Cycle2

Scheduling ProblemsScheduling Problems

• Several types of scheduling problemsSeveral types of scheduling problems– Usually some combination of performance and resource Usually some combination of performance and resource

constraintsconstraints

• Problems:Problems:1.1. UnconstrainedUnconstrained

1.1. Not very useful, every schedule is validNot very useful, every schedule is valid

2.2. Minimum latencyMinimum latency3.3. Latency constrainedLatency constrained4.4. Mininum-latency, resource constrainedMininum-latency, resource constrained

• i.e. find the schedule with the shortest latency, that uses less i.e. find the schedule with the shortest latency, that uses less than a specified # of resourcesthan a specified # of resources

• NP-CompleteNP-Complete

5.5. Mininum-resource, latency constrainedMininum-resource, latency constrained• i.e. find the schedule that meets the latency constraint (which i.e. find the schedule that meets the latency constraint (which

may be anything), and uses the minimum # of resourcesmay be anything), and uses the minimum # of resources• NP-CompleteNP-Complete

Minimum Latency SchedulingMinimum Latency Scheduling

• ASAP (as soon as possible) algorithmASAP (as soon as possible) algorithm– Find a Find a candidate nodecandidate node

• Candidate is a node Candidate is a node whose whose predecessorspredecessors have been scheduled and have been scheduled and completed (or has no predecessors)completed (or has no predecessors)

– Schedule node one cycle later than max cycle of predecessorSchedule node one cycle later than max cycle of predecessor

– RepeatRepeat until all nodes scheduled until all nodes scheduled

a b c d

e f g h

Cycle1

Cycle2

Cycle3

+Cycle4

Minimum possible latency - 4 cycles

• ALAP (as late as possible) algorithmALAP (as late as possible) algorithm– Run ASAP, get minimum latency L Run ASAP, get minimum latency L

– Find a candidateFind a candidate• Candidate is Candidate is node whose successorsnode whose successors are scheduled (or has none) are scheduled (or has none)

– Schedule node one cycle Schedule node one cycle before minbefore min cycle of predecessor cycle of predecessor• Nodes with no successors scheduled to cycle LNodes with no successors scheduled to cycle L

– Repeat Repeat until all nodes scheduleduntil all nodes scheduled

a b c d

e f g h

Cycle1

Cycle2

Cycle3

+Cycle4

Cycle4

Cycle3

L = 4 cycles

• ALAP (as late as possible) algorithmALAP (as late as possible) algorithm– Run ASAP, get minimum latency L Run ASAP, get minimum latency L

– Find a candidateFind a candidate• Candidate is node whose successors are scheduled (or has none)Candidate is node whose successors are scheduled (or has none)

– Schedule node one cycle Schedule node one cycle before minbefore min cycle of predecessor cycle of predecessor• Nodes with no successors scheduled to cycle LNodes with no successors scheduled to cycle L

– Repeat until all nodes scheduledRepeat until all nodes scheduled

a b c d

e f g h

Cycle1

Cycle2

Cycle3

+Cycle4

L = 4 cycles

• ALAPALAP

– Has to run ASAP first, seems pointlessHas to run ASAP first, seems pointless– But, many heuristics need the mobility/slack of But, many heuristics need the mobility/slack of

each operationeach operation• ASAP gives the earliest possible time for an operationASAP gives the earliest possible time for an operation• ALAP gives the latest possible time for an operationALAP gives the latest possible time for an operation

– Slack = difference between earliest and latest Slack = difference between earliest and latest possible schedulepossible schedule

• Slack = 0 implies operation has to be done in the current Slack = 0 implies operation has to be done in the current scheduled cyclescheduled cycle

• The larger the slack, the more options a heuristic has to The larger the slack, the more options a heuristic has to schedule the operationschedule the operation

BindingBinding

• During scheduling, we determined:During scheduling, we determined:– When ops will executeWhen ops will execute

– How many resources are neededHow many resources are needed

• We still need to decide which ops execute on which resourcesWe still need to decide which ops execute on which resources– => Binding=> Binding

– If multiple ops use the same resourceIf multiple ops use the same resource

• =>Resource Sharing=>Resource Sharing

BindingBinding

• Basic Idea - Map operations onto resources such that Basic Idea - Map operations onto resources such that operations in same cycle don’t use same resourceoperations in same cycle don’t use same resource

Cycle1

Cycle2

Cycle3

Cycle4

2 ALUs (+/-), 2 Multipliers

Mult1 ALU1 ALU2 Mult2

BindingBinding

• Many possibilitiesMany possibilities

– Bad binding may increase resources, require huge Bad binding may increase resources, require huge steering logic, reduce clock, etc.steering logic, reduce clock, etc.

Cycle1

Cycle2

Cycle3

Cycle4

2 ALUs (+/-), 2 Multipliers

Mult1 ALU1 ALU2Mult2

ENG6530 RCS 46

Computing Systems

Xilinx Vivado Xilinx Vivado

High Level Synthesis (HLS)High Level Synthesis (HLS)

Or AutoESLOr AutoESL

High-Level Synthesis: HLS

• High-Level Synthesis– Creates an RTL implementation from C

level source code

– Extracts control and dataflow from Extracts control and dataflow from the source codethe source code

– Implements the design Implements the design based on defaults and user applied directivesuser applied directives

• Many implementation are possible from the same source description– Smaller designs, faster designs, optimal

designs

– Enables manualmanual design exploration

ENG6530 RCS11- 47

AutoESL or Vivado HLSAutoESL or Vivado HLS

Script withConstraintsScript withConstraints

RTL Wrapper

………………………………………

………………………VHDLVerilog

System C

VHDLVerilog

System C

AutoESLAutoESL

Test benc

h Constraints/ Directives

Constraints/ Directives

………………

………………C, C++,

SystemC

C, C++, System

RTL SimulationRTL Simulation RTL SynthesisRTL Synthesis

• The primary commands have toolbar buttons– Easy access for standard tasks

– Button highlights when the option is available• E.g. cannot perform C/RTL simulation before synthesis

Create a new Project

Create a new Solution

Change Solution Settings

Change Project Settings

Run C SimulationRun C SimulationRun C SynthesisRun C Synthesis

Export RTLExport RTL

Open ReportsOpen Reports

Open Analysis Viewer

Compare ReportsCompare Reports

Run C/RTL CosimulationRun C/RTL Cosimulation

Using Vivado HLS 12 - 48

Vivado HLS GUI ToolbarVivado HLS GUI Toolbar

ENG6530 RCS

Design Exploration with Directives

The same hardware is used for each iteration of the loop:

•Small area•Long latency

•Low throughput

Different iterations are executed concurrently:•Higher area

•Short latency •Best throughput

… loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } }….

Different hardware is used for each iteration of the loop:

•Higher area•Short latency

•Better throughput

Before we get into details, let’s look under the hood ….

One body of code: Many hardware outcomes

Design Exploration with DirectivesDesign Exploration with Directives

ENG6530 RCS11- 49

• Perspective for design analysis– Allows interactive analysisAllows interactive analysis

Analysis PerspectiveAnalysis PerspectiveAnalysis Perspective

ENG6530 RCS

Introduction to High-Level Synthesis

• How is hardware extracted from C code?– Control and datapath can be extracted from C code at the top levelat the top level– The same principles used in the example can be applied to sub-

functions• At some point in the top-level control flow, control is passed to a sub-

function

• Sub-function may be implemented to execute concurrently with the top-level and or other sub-functions

• How is this control and dataflow turned into a hardware design?– AutoESL maps this to hardware through scheduling and binding scheduling and binding

processes

• How is my design created?– How functions, loops, arrays and IO ports are mapped?

ENG6530 RCS11- 51

Hardware ExtractionHardware Extraction

void fir ( data_t *y, coef_t c[4], data_t x ) {

static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;}

HLS: Control Extraction

From any C code example …

The loops in the C code correlated to states of behavior

Function Start

For-Loop Start

For-Loop End

Function End

Control Behavior

Finite State Machine (FSM) states

This behavior is extracted into a hardware state machine

ENG6530 RCS11- 52

Control ExtractionControl Extraction

HLS: Control & Datapath Extraction

From any C code example …

Control Behavior

Finite State Machine (FSM) states

The control is known

Operations

Operations are extracted…

Control & Datapath Behavior

A unified control dataflow behavior is created

Control Dataflow

RDx RDc

ENG6530 RCS11- 53

Control & Datapath ExtractionControl & Datapath Extraction

High-Level Synthesis: Scheduling & Binding

• Scheduling & Binding– Scheduling and Binding are at the heart of HLS

• Scheduling determines in which clock cycle an operation will occurwhich clock cycle an operation will occur– Takes into account the control, dataflow and user directives

– The allocation of resources can be constrained

• Binding determines which library cell is used for each operationwhich library cell is used for each operation– Takes into account component delays, user directives

Design Source(C, C++, SystemC)

Scheduling Binding

RTL(Verilog, VHDL, SystemC)

Technology Library

User Directives

ENG6530 RCS11- 54

HLS: Scheduling and Binding

• The operations in the control flow graph are mapped into clock cycles

• The technology and user constraints impact the schedule– A faster technology (or slower clock) may allow more operations to occur in

the same clock cycle

• The code also impacts the schedule– Code implications and data dependencies must be obeyed

Scheduling

void foo ( … t1 = a * b; t2 = c + t1; t3 = d * t2; out = t3 – e; }

*de out

* -*+Schedule 1

* -*+Schedule 2

ENG6530 RCS11- 55

Scheduling

Binding

• Binding is where operations are mapped to cores from the hardware library– Operators map to cores

• Binding Decision #1:– Given this schedule:

• Binding must use 2 multipliersuse 2 multipliers, since both are in the same cycleboth are in the same cycle• It can decide to use an adder and subtractor or one addsub

• Binding Decision #2: – Given this schedule:

• Binding may decide to share the multipliers share the multipliers (each is used in a different cycle)

• What affects the decision made by the Scheduler/Binder?• Timing and availability of resources.Timing and availability of resources.

• Binding may decide the cost of sharing (muxing) would impact timing impact timing and it may decide not to share them

• Binding may make this same decision in the first example above too

ENG6530 RCS11- 56

Binding

Understanding AutoESL Synthesis

• HLS– AutoESL determines in which cycle operations should occur (schedulingscheduling)– Determines which hardware units to use for each operation (bindingbinding)– It performs HLS by It performs HLS by :

I.I. ObeyingObeying built-in defaultsII.II. Obeying user directives Obeying user directives & constraints to override defaultsIII.III. Calculating Calculating delays and area using the specified technology/device

• Understand AutoESL defaults – Key to understanding the initial design created by AutoESL

• Understand the priority of directives1. Meet Performance (clock & throughput)

• AutoESL will allow a local clock path to fail if this is required to meet throughput• Often possible the timing can be met after logic synthesis

2. Then minimize latency3. Then minimize area

ENG6530 RCS11- 57

Understanding AutoESL Synthesis

• The vast majority of C, C++ and SystemC is supported– Provided it is statically defined at compile time

– If it’s not defined until run time, it won’ be synthesizableit won’ be synthesizable

• Any of the three variants of C can be used– If CC is used, Vivado HLS expects the file extensions to be .cto be .c

– For C++ and SystemC C++ and SystemC it expects file extensions .cpp.cpp

C, C++ and SystemC Support

Coding Considerations 23- 58

• System calls and function pointers– Dynamic memory allocation

• malloc() & free()– Standard I/0 and file I/O operations

• fprintf() / fscanf() etc.– System calls

• time(), sleep() etc.

• Data types– Forward declared type– Recursive type definitions

• Type contains members with the same type

• Non-standard Pointers– Pointer casting between general data types

• OK with native integers types– If a double pointer double pointer is used in multiple functions, Vivado HLS will inline all the

functions • Slower synthesis, may increase area & run time

Unsupported Constructs: Overview

Coding Considerations 23- 59

Unsupported Constructs

The Key Attributes of C code

static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i] * c[i];} } *y=acc;}

Functions: All code is made up of functions which represent the design hierarchy: the same in hardware

Loops: Functions typically contain loops. How these are handled can have a major impact on area and performancearea and performance

Arrays: Arrays are used often in C code. They can influence the device IOdevice IO and become performance bottlenecks

Operators: Operators in the C code may require sharing to control area or specific hardware implementations to meet performance

Types: All variables are of a defined type. The type can influence the area and performancearea and performance

Let’s examine the default synthesis behavior of these …

Top Level IO : The arguments of the top-level function determine the hardware RTL interface portshardware RTL interface ports

ENG6530 RCS11- 60

The Key Attributes of C Code

Types = Operator Bit-sizes

From any C code example ...

Operations

Operations are extracted…

The C types define the size of the hardware used: handled automatically

long long (64-bit)

int (32-bit)

short (16-bit)

char (8-bit)

double (64-bit)float (32-bit)

unsigned types

Standard C types

For floats and doubles there must be a FP core in the library binding can map to, else cannot be synthesized

Arbitary Precision types

C: ap(u)int types (1-1024)

C++: ap_(u)int types (1-1024) ap_fixed types

C++/SystemC: sc_(u)int types (1-1024)sc_fixed types

Can be used to define any variable to be a specific bit-width (e.g. 17-bit, 47-bit etc).

ENG6530 RCS11- 61

Types = Operator Bit-Sizes

• Code using native C int type

• However, if the inputs will only have a max range of 8-bit– Arbitrary precision data-types should be used

– It will result in smaller & faster hardware smaller & faster hardware with the full required precision– With arbitrary precision types on function interfaces, Vivado HLS can

propagate the correct bit-widths throughout the designpropagate the correct bit-widths throughout the design

Why is arbitrary precision Needed?

Data Types 14- 62

Why Arbitrary Precision?

• There are 4 basic types you can use for HLS– Standard C/C++ Types

– Vivado HLS enhancements to C: apint

– Vivado HLS enhancements to C++: ap_int, ap_fixed

– SystemC types

HLS & C Types

14- 63Data Types 14- 63

HLS &C Types

• For C – Vivado HLS types apint can be used

– Range: 1 to 1024 bits

– Specify the integers as shown and just use them like any other variable

• There are two issues to be aware of – C compilation : YOU MUST use apcc MUST use apcc to simulate (no debugger support)

– Be aware of integer promotion issues

Arbitrary Precision : C apint types

#include ap_cint.h

void foo_top (…) { int9 var1; // 9-bit uint10 var2; // 10-bit unsigned

#include ap_cint.h

void foo_top (…) { int9 var1; // 9-bit uint10 var2; // 10-bit unsigned

Include header file

Failure to use apcc to compile the C will result in INCORRECT results

This only applies to CNOT C++ or SystemC

14- 64Data Types 14- 64

C apint Types

• apcc– Command line compatible with gcc

– Required to support arbitrary precision for C

– Use apcc at the Vivado HLS CLI (shell)

• apcc understands bit-accurate types

– Once you create bit-accurate types you must re-validate the C

– It’s the only way to discover rounding and truncation issues• It’s fast in C !!!

#include “ap_cint.h”int3 ex_bit_accurate ( int3 x1, int3 y1 ) { return x1+y1;}

Given: x1=2 y1=2

00 11 00

11 00 00

apcc simulation

Simulates as

hardware

00 00 11 0000 00 00……00 00 11 0000 00 00……

00 11 00 0000 00 00……+

gcc simulation

return

shell> apcc –o my_test test.c test_tb.c shell> apcc –o my_test test.c test_tb.c

Data Types 14- 65

Using apcc

• Support for fixed point datatypes in C++– Include the path to the ap_fixed.h header file

– Both signed (ap_fixed) and unsigned types (ap_ufixed)

• Advantages of Fixed Point types– The result of variables with different sizes is automatically taken care is automatically taken care of

– The binary point is automatically aligned• Quantization: UnderflowUnderflow is automatically handled• Overflow: SaturationSaturation is automatically handled

Arbitrary Precision : C++ ap_fixed types

#include ap_fixed.h

void foo_top (…) { ap_fixed<9, 5, AP_RND_CONV, AP_SAT> var1; // 9-bit,

// 5 integer bits, 4 decimal places

ap_ufixed<10, 7, AP_RND_CONV, AP_SAT> var2; // 10-bit unsigned // 7 integer bits, 3 decimal places

#include ap_fixed.h

void foo_top (…) { ap_fixed<9, 5, AP_RND_CONV, AP_SAT> var1; // 9-bit,

// 5 integer bits, 4 decimal places

ap_ufixed<10, 7, AP_RND_CONV, AP_SAT> var2; // 10-bit unsigned // 7 integer bits, 3 decimal places

$VIVADO_HLS_HOME/include/ap_fixed.h

Alternatively, make the result variable large enough such that overflow or underflow does not occur14- 66Data Types 14- 66

C++ap_fixed types

• Fixed point types are specified by– Total bit width (W)

– The number of integer bits (I)

– The quantization/rounding mode (Q)

– The overflow/saturation mode (O)

– The number of saturation bits

Definition of ap_fixed type

DescriptionW Word length in bitsI The number of bits used to represent the integer value (the number of bits above the decimal point)

Q Quantization mode (modes detailed below) dictates the behavior when greater precision is generated than can be defined by the LSBs.AP_Fixed Mode DescriptionAP_RND Rounding to plus infinity AP_RND_ZERO Rounding to zero AP_RND_MIN_INF Rounding to minus infinity AP_RND_INF Rounding to infinity AP_RND_CONV Convergent rounding AP_TRN Truncation to minus infinity AP_TRN_ZERO Truncation to zero (default)

O Overflow mode (modes detailed below) dictates the behavior when more bits are required than the word contains.

AP_Fixed Mode DescriptionAP_SAT SaturationAP_SAT_ZERO Saturation to zeroAP_SAT_SYM Symmetrical saturationAP_WRAP Wrap around (default)AP_WRAP_SM Sign magnitude wrap around

N The number of saturation bits in wrap modes.

Binary point : W = I + B

ap_[u]fixed<W, I , Q, O , N> ap_[u]fixed<W, I , Q, O , N>

I-1I-1 -1-1 …… -B-B11 00……

14- 67

Data Types 14- 67

Definition of ap_fixed type

• Synthesis for floating point – Data types (IEEE-754 standard compliant)

• Single-precision 32 bit: 24-bit fraction, 8-bit exponent

• Double-precision 64 bit: 53-bit fraction, 11-bit exponent

• Support for Operators– Vivado HLS supports the Floating Point (FP) Vivado HLS supports the Floating Point (FP) cores for each Xilinx

technology• If Xilinx If Xilinx has a FP core, Vivado HLS supports it• It will automatically be synthesized

– If there is no such FP core in the Xilinx technology, it will not be in If there is no such FP core in the Xilinx technology, it will not be in the librarythe library

• The design will be still synthesizedThe design will be still synthesized

Floating Point Support

14- 68Data Types 14- 68

Floating Point Support

14- 69

Floating Point Cores

• Vivado HLS provides support for many math functions– Even if no floating-point core exists– These functions are implemented in a bit-approximate manner– The results may differ within a few Units of Least Precision (ULP) to the C/C++

standards

• If you Use math.h (C) or cmath.h (C++)– The functions will be synthesized will be synthesized automatically– The C simulation C simulation results may differ results may differ from the RTL simulation from the RTL simulation results– Use a test bench which checks for ranges: not == or !=

• If you replace math.h or cmath.h with Vivado HLS header file “hls_math.h” Or keep math/cmath and “add_files hls_lib.c”

– The C simulation C simulation will match will match the RTL simulation the RTL simulation – The C simulation may differ from the C simulation using math/cmath (or math/cmath

without hls_lib.c)

Support for Math Functions

More Details are available in the Coding Style Guide chapter in the User GuideMore Details are available in the Coding Style Guide chapter in the User Guide

Data Types 14- 70

Support for Math Functions

Loops • By default, loops are rolled

– Each C loop iteration Implemented in the same state – Each C loop iteration Implemented with same resourcessame resources

– IMPORTANT: Loops can be unrolled IMPORTANT: Loops can be unrolled if their indices are statically determinable at elaboration time

• Not when the number of iterations is variable

– Unrolled loops result in more elements to schedule Unrolled loops result in more elements to schedule but greater but greater operator mobilityoperator mobility

• Let’s look at an example ….

void foo_top (…) { ... Add: for (i=3;i>=0;i--) {

b = a[i] + b; ... } SynthesisSynthesis

foo_top++

Loops require labels if they are to be referenced by Tcl directives

(GUI will auto-add labels)

ENG6530 RCS11- 71

Data Dependencies: Good

• Example of good mobility– The read on data port X can occur anywhere read on data port X can occur anywhere from the start to iteration 4

• The only constraint on RDx is that it occur before the final multiplication

– AutoESL has a lot of freedom with this operationAutoESL has a lot of freedom with this operation• It waits until the read is required, saving a register• There are no advantages to reading any earlier (unless you want it registered)• Input reads can be optionally registered

– The final multiplication is very constrained… The final multiplication is very constrained…

void fir ( …acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;}

>=* WRy

Iteration 1 Iteration 2 Iteration 3 Iteration 4

- - -RDcRDcRDcRDc

The read X operation has good mobility

Default Schedule

ENG6530 RCS11- 72

Data Dependencies: Good Data Dependencies: Good

Data Dependencies: Bad

• Example of bad mobility– The final multiplication must occur before the readfinal multiplication must occur before the read and final addition

• It could occur in the same cycle if timing allows

– Loops are rolled by default• Each iteration cannot start till the previous iteration completes• The final multiplication (in iteration 4) must wait for earlier iterations to complete

– The structure of the code is forcing a particular schedule • There is little mobility for most operations

– Optimizations allow loops to be unrolled giving greater freedom

>=* WRy

Iteration 1 Iteration 2 Iteration 3 Iteration 4

- - -RDcRDcRDcRDc

Mult is very constrained

Default Schedule

ENG6530 RCS11- 73

Data Dependencies: Bad Data Dependencies: Bad

Schedule after Loop Optimization

• With the loop unrolled (completely)– The dependency on loop iterations is gone

– Operations can now occur in parallelOperations can now occur in parallel• If data dependencies allowIf data dependencies allow• If operator timing allowsIf operator timing allows

– Design finished fasterfaster but uses more operators• 2 multipliers & 2 Adders

• Schedule Summary– All the logic associated with the loop counters and

index checking are now gone

– Two multiplications can occur at the same time• All 4 could, but it’s limited by the number of input All 4 could, but it’s limited by the number of input

reads (2) on coefficient port Creads (2) on coefficient port C

– Why 2 reads on port C? • The default behavior for arrays now limits the schedule…

ENG6530 RCS11- 74

Schedule after Loop OptimizationSchedule after Loop Optimization

Arrays in HLS

• An array in C code is implemented by a memory in the RTL– By default, arrays are implemented as RAMs, optionally a FIFO

• The array can be targeted to any memory resource to any memory resource in the library– The ports (Address, CE active high, etc.) and sequential operation (clocks

from address to data out) are defined by the library model– All RAMs are listed in the AutoESL Library Guide

• Arrays can be merged can be merged with other arrays and reconfigured– To implement them in the same memory or one of different widths & sizes

• Arrays can be partitioned can be partitioned into individual elements– Implemented as smaller RAMs or registers

void foo_top(int x, …){ int A[N]; L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }

N-1N-1

N-2N-2

……

SynthesisSynthesis

foo_top

DOUTDINADDR

SPRAMB

A_outA_in

ENG6530 RCS11- 75

Arrays in HLSArrays in HLS

Top-Level IO Ports

• Top-level function arguments– All top-level function arguments have a default hardware port type

• When the array is an argument of the top-level function– The array/RAM is “off-chip”The array/RAM is “off-chip”

– The type of memory resource determines the top-level IO ports

– Arrays on the interface can be mapped & partitioned• E.g. partitioned into separate ports for each element in the array

• Default RAM resource– Dual port RAM if performance can be improved Dual port RAM if performance can be improved otherwise Single Port RAM

SynthesisSynthesis

foo_top DOUT0DIN0ADDR0

CE0WE0

DPRAMBvoid foo_top( int A[3*N] , int x){ L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }

Number of ports defined by the RAM resource

DIN1ADDR1

CE1WE1

ENG6530 RCS11- 76

Top-Level IO PortsTop-Level IO Ports

Schedule after an Array Optimization

• With the existing code & defaults– Port C is a dual port RAMdual port RAM

– Allows 2 reads per clock cyclesAllows 2 reads per clock cycles• IO behavior impacts performance

• With the C port partitioned into (4) separate ports– All reads and mults can occur in one cycleAll reads and mults can occur in one cycle

– If the timing allows• The additions can also occur in the same cycle• The write can be performed in the same cycles• Optionally the port reads and writes could be registered

RDcRDcRDc

Note: It could have performed 2 reads in the original rolled design but there was no advantage since the

rolled loop forced a single read per cycle

loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;

ENG6530 RCS11- 77

Schedule after an Array OptimizationSchedule after an Array Optimization

Operators

• Operator sizes are defined by the type (*, _, -, /)– The variable type defines the size of the operator

• AutoESL will try to minimize the number of operators– By default AutoESL will seek to minimize area after constraints are satisfied

• User can set specific limits & targets for the resources used– Allocation can be controlledAllocation can be controlled

• An upper limit can be set on the number of operators or cores allocated for the design: This can be used to force sharing

• e.g limit the number of multipliers to 1 will force AutoESL to sharelimit the number of multipliers to 1 will force AutoESL to share

– Resources can be specified• The cores used to implement each operator can be specified• e.g. Implement each multiplier using a 2 stage pipelined core (hardware)Implement each multiplier using a 2 stage pipelined core (hardware)

33 22 11 00

Use 1 mult, but take 4 cycle even if it could be done in 1 cycle using 4 mults

Same 4 mult operations could be done with 2 pipelined mults (with allocation

limiting the mults to 2)33 11

22 00 ENG6530 RCS11- 78

OperatorsOperators

Input DataInternal Data

Loop 2/4Loop 2/4

Loop 3/4Loop 3/4 Loop 4/4Loop 4/4

Output DataY0Y000 Loop 1/4Loop 1/4

Latency The number of cycles from input to output (final output of an array write) 14 cycles

Throughput= 14

Throughput The number of cycle between new input samples (in this example it must wait for

all operations to complete before it can read a new input)14 cycles

Data Rate The 1/throughout * clock frequency 10ns clock => 7.14 Mhz, ((1/10e9)*14)

Latency = 14

Tripcount = 4

Initiation Interval (II)

The number of cycles between new inputs to a pipeline (the same as throughput, but

this term is used with pipelines). Not shown in this example.

Trip count The number of iterations in a loop 4

Loop Latency The latency of the entire loop (divide by tripcount to get the latency for each loop iteration) 12 cycles

Loop Latency= 12

ENG6530 RCS

You may have your own terminology: this is AutoESL’s

11- 79

AutoESL Terminology (Clock Cycles)AutoESL Terminology (Clock Cycles)

• Vivado HLS has a number of way to improve performanceimprove performance– Automatic (and default) optimizations

– Latency directives

– Pipelining to allow concurrent operations

• Vivado HLS support techniques to remove performance bottlenecks– Manipulating loops

– Partitioning and reshaping arrays

• Optimizations are performed using directivesusing directives– Let’s look first at how to apply and use directives in Vivado HLS

Improving Performance

Improving Performance 13- 80

Improving PerformanceImproving Performance

ENG6530 RCS

• Directives can be placed in the directives file

– The Tcl command is written into directives.tcl

– There is a directives.tcl file in each solution

• Each solution can have different directivesEach solution can have different directives

• Directives can be place into the C source

– Pragmas are added (and will remain) in the C source file

– Pragmas (#pragma) will be used by every Pragmas (#pragma) will be used by every solution which uses the codesolution which uses the code

Optimization Directives: Tcl or Pragma

Once applied the directive will be shown in the Directives tab (right-click to modify or

delete)

Optimization DirectivesOptimization Directives

ENG6530 RCS

• Select the New Solution Button• Optionally modify any of the settings

– Part, Clock Period, Uncertainty– Solution Name

• Copy existing directives– By default selected– Uncheck if do not want to copy– No need to copy pragmas, they are in the code

• Copy any existing custom commands in to the new script.tcl

– By default selected– Uncheck if do not want to copy

Copying Directives into New Solutions

Different Solutions (Directives)Different Solutions (Directives)

ENG6530 RCS

Functions & RTL Hierarchy

• Each function is translated into an RTL block– Verilog module, VHDL entity

– By default, each function is implemented using a common instance– Functions may be inlined may be inlined to dissolve their hierarchy

• Small functions may be automatically inlined

void A() { ..body A..}void B() { ..body B..}void C() {

B();}void D() {

void foo_top() {A(…);

C(…);D(…)

foo_top

Source CodeRTL hierarchy

Each function/block can be shared like any other component (add, sub, etc) provided it’s not in use at the same time

my_code.cmy_code.c

ENG6530 RCS11- 83

Functions & RTL Hierarchy

sumsub_func

shift_func

>>2>>2>>1>>1

add_sub_pass

• Inlining can be used to remove function hierarchy

int sumsub_func (int *in1, int *in2, int *outSum, int *outSub) {*outSum = *in1 + *in2;*outSub = *in1 - *in2;}

int shift_func (int *in1, int *in2, int *outA, int *outB) { *outA = *in1 >> 1; *outB = *in2 >> 2;}

void add_sub_pass(int A, int B, int *C, int *D) { int apb, amb; int a2, b2; sumsub_func(&A,&B,&apb,&amb); sumsub_func(&apb,&amb,&a2,&b2); shift_func(&a2,&b2,C,D);

int sumsub_func (int *in1, int *in2, int *outSum, int *outSub) {*outSum = *in1 + *in2;*outSub = *in1 - *in2;}

int shift_func (int *in1, int *in2, int *outA, int *outB) { *outA = *in1 >> 1; *outB = *in2 >> 2;}

void add_sub_pass(int A, int B, int *C, int *D) { int apb, amb; int a2, b2; sumsub_func(&A,&B,&apb,&amb); sumsub_func(&apb,&amb,&a2,&b2); shift_func(&a2,&b2,C,D);

add_sub_pass

AA B>>1B>>1

Zero AreaZero Area

Inlining allows optimization to be performed across function hierarchies

No Inlining Inlining

2 Adders2 Subtractors

A+BA+B A-BA-B

Like RTL ungrouping, too much inlining can create a lot of logic and slow runtime

A+BA-B2A

A+BA-B2B

AA B>>1B>>1

21- 84Improving Area and Resources 21- 84

Function Inlining

• Vivado HLS performs some inlining automatically– This is performed on small logic functions if Vivado HLS determines area or

performance will benefit

• User Control– Functions can be specifically inlinedFunctions can be specifically inlined

• The function itself is inlined

– Optionally recursively down the hierarchy

– Optionally everything within a region can be inlined• Everything named region or a function or a loop

– Optionally inlining can be explicitly preventedOptionally inlining can be explicitly prevented• Turn inlining offTurn inlining off

• Inlining functions allows for greater optimization– Like ungrouping RTL hierarchies: optimization across boundaries

– Like ungrouping RTL hierarchies it can result in lots of operations & impact run time

Controlling Inlining

21- 85Improving Area and Resources 21- 85

Controlling Inlining

• Design Latency– The latency of the design is the number of cycle it takes to output the resultcycle it takes to output the result

• In this example the latency is 10 cycles

• Design Throughput– The throughput of the design is the

number of cycles between new inputs• By default (no concurrency) this is the

same as latency

• Next start/read is when this transaction ends

• In the absence of any concurrency– Latency is the same as throughput

Latency and Throughput – The Performance Factors

Latency and ThroughputLatency and Throughput

ENG6530 RCS

• Given a design with multiple functions– The code and dataflow are as shown

• Vivado HLS will schedule the design

• It can also automatically optimize the dataflow optimize the dataflow for throughput

Improving Throughput

Improving ThroughputImproving Throughput

ENG6530 RCS

• Dataflow Optimization– Can be used at the top-level function– Allows blocks of code to operate concurrentlyAllows blocks of code to operate concurrently

• The blocks can be functions or loops

• Dataflow allows loops to operate concurrently

– It places channels between the blocks places channels between the blocks to maintain the data rate

• For arrays the channels will include memory elements to buffer the samples

• For scalars the channel is a register with hand-shakes

• Dataflow optimization therefore has an area overheadhas an area overhead– Additional memory blocks are added to the design

Dataflow Optimization

Dataflow OptimizationDataflow Optimization

ENG6530 RCS

• Dataflow is set using a directive– Vivado HLS will seek to create the highest performance design

• Throughput of 1

Dataflow Optimization Commands

Dataflow Optimization CommandsDataflow Optimization Commands

ENG6530 RCS

• Dataflow Optimization– Dataflow optimization is “coarse graincoarse grain” pipelining at the function and loop

– Increases concurrency between functions and loops

– Only works on functions or loops at the top-level Only works on functions or loops at the top-level of the hierarchy• Cannot be used in sub-functions

• Function & Loop Pipelining– ““Fine grain” pipelining Fine grain” pipelining at the level of the operators (*, +, >>, etc.)

– Allows the operations inside the function or loop to operate in parallelAllows the operations inside the function or loop to operate in parallel

– Unrolls all sub-loops inside the function or loop being pipelined• Loops with variable bounds cannot be unrolled: This can prevent pipelining• Unrolling loops increases the number of operations and can increase memory

and run time

Pipelining: Dataflow, Functions & Loops

Dataflow versus PipeliningDataflow versus Pipelining

ENG6530 RCS

• There are 3 clock cycles before operation RD can occur again

– Throughput = 3 cycles

• There are 3 cycles before the 1st output is written

– Latency = 3 cycles

• The latency is the same

• The throughput is better

– Less cycles, higher throughput

Without Pipelining

Latency = 3 cycles

Throughput = 3 cycles

RDRD CMPCMP WRWR RDRD CMPCMP WRWR

With Pipelining

Latency = 3 cycles

Throughput = 1 cycle

RDRD CMPCMP WRWR

void foo(...) { op_Read; op_Compute; op_Write;}

CMPCMP

Function PipeliningFunction Pipelining

ENG6530 RCS

• The pipeline directive pipeline directive pipelines functions or loops– This example pipelines the function with an Initiation

Interval (II) of 2• The II is the same as the throughput but this term is used

exclusively with pipelines

• Omit the target II and Vivado HLS will Automatically

pipeline for the fastest possible design– Specifying a more accurate maximum may allow more

sharing (smaller area)

Pipelining Commands

RDRD CMPCMP WRWR

Initiation Interval (or II)

Pipelining CommandsPipelining Commands

ENG6530 RCS

• Vivado HLS will attempt to unroll all loops nested below a PIPELINE directive– May not succeed for various reason May not succeed for various reason and/or may lead to unacceptable area

• Loops with variable bounds cannot be unrolledLoops with variable bounds cannot be unrolled• Unrolling Multi-level loop nests may create a lot of hardware Unrolling Multi-level loop nests may create a lot of hardware

– Pipelining the inner-most loop will result in best performance for area• Or next one (or two) out if inner-most is modest and fixed

e.g. Convolution algorithm

• Outer loops will keep the inner pipeline fed

Pipelining and Function/Loop Hierarchy

void foo(in1[ ][ ], in2[ ][ ], …) {#pragma AP PIPELINE … L1:for(i=1;i<N;i++) { L2:for(j=0;j<M;j++) { out[i][j] = in1[i][j] + in2[i][j]; } }}

void foo(in1[ ][ ], in2[ ][ ], …) {… L1:for(i=1;i<N;i++) {#pragma AP PIPELINE L2:for(j=0;j<M;j++) { out[i][j] = in1[i][j] + in2[i][j]; } }}

void foo(in1[ ][ ], in2[ ][ ], …) {… L1:for(i=1;i<N;i++) { L2:for(j=0;j<M;j++) {#pragma AP PIPELINE out[i][j] = in1[i][j] + in2[i][j]; } }}

Unrolls L1 and L2 N*M adders, 3(N*M) accesses

Unrolls L2M adders, 3M accesses

1adder, 3 accesses

Pipelining and Function/Loop HierarchyPipelining and Function/Loop Hierarchy

ENG6530 RCS

Select loop “Add” in the directives pane and right-click

Unrolled loops allow greater option & exploration

Unrolled loops are likely to result in more hardware resources and higher area

Unrolling LoopsUnrolling Loops

ENG6530 RCSImproving Performance 13- 94

• Vivado HLS can automatically flatten nested loops– A faster approach than manually changing the code

• Flattening should be should be specified on the inner most loopinner most loop– It will be flattened into the loop above– The “off” option can prevent loops in the hierarchy from being flattened

Loop Flattening

void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }

L2: for (i=3;i>=0;i--) { L3: for (j=3;j>=0;j--) { [loop body l3 ] } }

L4: for (i=3;i>=0;i--) { [loop body l4 ] }

L2: for (i=3;i>=0;i--) { L3: for (j=3;j>=0;j--) { [loop body l3 ] } }

36 transitions

L2: for (k=15,k>=0;k--) {

[loop body l3 ]}

L2: for (k=15,k>=0;k--) {

[loop body l3 ]}

28 transitionsLoops will be flattened by default: use “off” to disable

Loop FlatteningLoop Flattening

ENG6530 RCS

C Validation and RTL Verification

• There are two steps to verifying the design– Pre-synthesis: C Validation

• Validate the algorithm is correct

– Post-synthesis: RTL Verification• Verify the RTL is correct

• C validation– A HUGE reason users want to use HLS

• Fast, free verification− Validate the algorithm is correct before

synthesis• Follow the test bench tips given over

• RTL Verification– AutoESL can co-simulate the RTL with

the original test bench

Test BenchTest Bench

Script withConstraintsScript withConstraints

………………

……………………………

………………

…VHDLVerilog

System C

VHDLVerilog

System C

AutoESLAutoESL

Constraints/ Directives

………………………………………

………………………C, C++, System

C, C++, System

RTL SynthesisRTL Synthesis

Validate C

Verify RTL

ENG6530 RCS11- 96

C Validation and RTL VerificationC Validation and RTL Verification

C Function Test Bench

• The test bench is the level above the function– The main() function is above the function to be synthesized

• Good Practices– The test bench should compare the results with golden data

• Automatically confirms any changes to the C are validated• Automatically verifies the RTL is correct

– The test bench should return a 0 if the self-checking is correct• Anything but a 0 (zero) will cause RTL verification to issue a FAIL message• Function main() should expect an integer return (non-void)

int main () { int ret=0; … ret = system("diff --brief -w output.dat output.golden.dat"); if (ret != 0) { printf("Test failed !!!\n"); ret=1; } else { printf("Test passed !\n"); } … return ret;}

ENG6530 RCS11- 97

C Function Test BenchC Function Test Bench

• The test bench should be in a separate file • Or excluded from synthesis

– The Macro __SYNTHESIS__ can be used to isolate code which will not be synthesized

• This macro is defined when Vivado HLS parses any code (-D__SYNTHESIS__)

// test.c#include <stdio.h>void test (int d[10]) { int acc = 0; int i; for (i=0;i<10;i++) { acc += d[i]; d[i] = acc; }}#ifndef __SYNTHESIS__int main () { int d[10], i; for (i=0;i<10;i++) { d[i] = i; } test(d); for (i=0;i<10;i++) { printf("%d %d\n", i, d[i]); } return 0;}#endif

Test benches I

Design to be synthesized

Test BenchNothing in this ifndef will be read by Vivado HLS

(will be read by gcc)

Test BenchesTest Benches

ENG6530 RCS

Determine or Create the top-level function

• Determine the top-level function for synthesis• If there are Multiple functions, they must be merged

– There can only be 1 top-level function for synthesis

int main () { ... func_A(a,b,*i1); func_B(c,*i1,*i2); func_C(*i2,ret)

return ret;}

func_Afunc_A

func_Bfunc_B

func_Cfunc_C

main.cmain.c

Given a case where functions func_A and func_B are to be

implemented in FPGA

Given a case where functions func_A and func_B are to be

implemented in FPGA

#include func_AB.hfunc_AB(a,b,c, *i1, *i2) {

... func_A(a,b,*i1); func_B(c,*i1,*i2); …

#include func_AB.hint main (a,b,c,d) {

... // func_A(a,b,i1); // func_B(c,i1,i2); func_AB (a,b,c, *i1, *i2); func_C(*i2,ret)

return ret;}

func_Afunc_A

func_Bfunc_B

func_ABfunc_AB

func_Cfunc_C

main.cmain.c

func_AB.cfunc_AB.c

Re-partition the design to create a new single top-level function inside

main()

Re-partition the design to create a new single top-level function inside

main()

Recommendation is to separate test bench and

design files*

ENG6530 RCS11- 99

Determine or Create Top Level FunctionDetermine or Create Top Level Function

Productivity– Verification

• Functional

• Architectural

– Abstraction• Datatypes

• Interface

• Classes

– Automation

Block level specification AND verification significantly reduced

Vivado HLS Benefits

RTL (Spec) RTL (Sim)

C (Spec/Sim) RTL (Sim)

ENG6530 RCS

Portability– Processors and FPGAs

– Technology migration

– Cost reduction

– Power reduction

Design and IP reuse

Vivado HLS Benefits

ENG6530 RCS

Permutability– Architecture Exploration

• Timing

Parallelization

Pipelining

• Resources

Sharing

– Better QoR

Rapid design exploration delivers QoR rivaling hand-coded RTL

Vivado HLS Benefits

ENG6530 RCS

Large Networking Company Video Up-scaler

Total effort for AutoESL design 2 days

AutoESL Runtime 8 seconds

Slice Registers 1651

Slice LUTs 1566

DSP48s 34

Achieved Throughput :720p -> 1080p @ 150 MHZ in Virtex 5

60 fps

Fast design time

Efficient RTL

High-definition video

11a- 103 ENG6530 RCS

Large Networking Company: Video Up-scalerLarge Networking Company: Video Up-scaler

Comprehensive C Support

• A Complete C Validation & Verification Environment– AutoESL supports complete bit-accurate validation of the C model

– AutoESL provides a productive C-RTL co-simulation verification solution

• AutoESL supports C, C++ and SystemC– Functions can be written in any version of C

– Wide support for coding constructs in all three variants of C• It’s easier to discuss what’s not supported than what is

• Modeling with bit-accuracy– Supports arbitrary precision types for all input languages

– Allowing the exact bit-widths to be modeled and synthesized

• Floating point support– Support for the use of float and double in the code

• Pointers and Streaming based applications– Multi-access pointer issues and streams

ENG6530 RCS11- 104

Comprehensive C SupportComprehensive C Support

• The vast majority of C, C++ and SystemC is supported– Provided it is statically defined at compile time

– If it’s not defined until run time, it won’ be synthesizable

• Any of the three variants of C can be used– If C is used, AutoESL expects the file extensions to be .c

– For C++ and SystemC it expects file extensions .cpp

ENG6530 RCS11- 105

C, C++ and SystemC SupportC, C++ and SystemC Support

Summary

• In High-Level Synthesis (HLS)– C becomes RTL

– Operations in the code map to hardware resources

– Understand how constructs such as functions, loops and arrays are synthesized

• HLS design involves– Synthesize the initial design

– Analyze to see what limits the performance• User directives to change the default behaviors• Remove bottlenecks

– Analyze to see what limits the area• The types used define the size of operators• This can have an impact on what operations can fit in a clock cycle

• Use directives to shape the initial design to meet performance– Increase parallelism to improve performance

– Refine bit sizes and sharing to reduce areaENG6530 RCS11- 106

SummarySummary

ENG6530 RCS 107

Computing Systems

Celoxica Handel-CCeloxica Handel-C

ENG6530 RCS 108

Handel-CHandel-C

Programming language- enables compilation of programs into synchronous hardware

NOT Hardware Description LanguageNOT Hardware Description Language- it’s a programming language aimed at compiling high-level algorithms into gate-level hardware

Syntax (loosely) based on based on “C”

Handel-C is to hardware (gates) what “C” is to micro-assembly code

ENG6530 RCS 109

Handel-C: AdvantagesHandel-C: Advantages Hardware design produced is exactlyis exactly the hardware

specified in source program

Logic gates are assembly instructions of Handel-C system

No intermediate “interpreting” layer as in assembly language

targeting general purpose microprocessor

Easy to learn!

Design/re-design/optimize at software level!!!

ENG6530 RCS 110

Comparison with “C”Comparison with “C” SimilarSimilar:

- Programs inherently sequential- Similar control-flow constructs: if-then-else, switch, while, for, etc.

DissimilarDissimilar:- No malloc/ dynamic store allocation- No recursion (limited rec. in macros)- No nested procedures- No stdin/stdout - “Void main()”- variable width words- variable width words- PAR, etc.- PAR, etc.

Example 1 (sum)

Void main(){ unsigned int 16 sum; // variable width word

unsigned int 8 data;chanin input; // input/outputchanout output;

sum=0;do {

input?data; sum = sum + (0@data);

} while (data!=0);output!sum;

IMPORTANT – width!!

ENG6530 RCS 111

ENG6530 RCS 112

Main program structure Comments /* */ // Variables Constants Arrays Structures Conditional Execution

If statement Switch statement

Arithmetic, Relational, Relational Logic ops Iteration

For Loop While loop Do … While Loop

Supported Declaration & StatementsSupported Declaration & Statements

ENG6530 RCS 113

Handel-C describes Hardware!Handel-C describes Hardware!

No side effects in expressions i.e. statements like a = b*c++; are not supported

No floating point Floating point not directly supported by Handel-C but DK4/5

includes a library for fixed and floating point arithmetic

No run-time recursion Due to the absence of any kind of ‘call stack’ in hardware.

Limited standard library (i.e. no printf, fopen etc.) However, DK allows direct calls to external functions written in C/C+

+, and these could incorporate file I/O, user interaction, recursion, etc.

ENG6530 RCS 114

DeclarationsDeclarations Handel-C uses two kinds of objects:

1. Logic types2. Architecture types

Logic types specify variables The basic logic type is intint

Architecture types specify variables that require a particular sort of hardware architecture

ROMsROMs RAMSRAMS Channels (I/O Simulation)Channels (I/O Simulation) Interfaces (Connect to Board, i.e., busses)Interfaces (Connect to Board, i.e., busses)

ENG6530 RCS 115

VariablesVariables The range of an 8-bit signed integer is -128 to 127

Signed integers use 2’s complement representation The range of an 8-bit unsigned integer is 0 to 255

inclusive. Predetermined widths available

Char (8), short (16), long (32), int32 (32), int64 (64) Handel-C provides support for porting from conventional C by

allowing the types char, short and long Examples:

unsigned charchar w; // 8-bits (signed) shortshort y; // 16-bits unsigned longlong z; // 32-bits

ENG6530 RCS 116

VariablesVariables Handel-C has one basic type - integer

May be signed or unsigned

Can be any width, not limited to 8, 16, 32 etc.

VariablesVariables are mapped to hardware are mapped to hardware registersregisters..

void main(void){

unsigned 6 a;a=45;

1 0 1 1 0 1 = 0x2da =

LSBMSB

ENG6530 RCS 117

Features & Statements(contd.)Features & Statements(contd.) Variables /* Compiler will determine suitable width of vars */

int 10 x, y, z;int undefinedundefined a;a = x + y;

Arrays (declarations same as Conventional C) Index must be compile-time constant Access in parallel of array variables is allowed Implemented as seq. of registersImplemented as seq. of registers (expensive) int 6 x[7]; x[4] = 1; Unsigned int 6 x[4] [5] [6];

ENG6530 RCS 118

A Simple Program

ENG6530 RCS 119

Assignments Hardware

ENG6530 RCS 120

Handel-C Timing

ENG6530 RCS 121

Sequential Execution

ENG6530 RCS 122

Handel-C: ParallelismHandel-C: Parallelism Handel-C blocks are by defaultdefault sequential par{…} executes statements in parallel par block completes when all statements complete

Time for block Time for block is time for longest statement Can nest sequential blocks in par blocks

// 3 Clock Cycles {

a=1;b=2;c=3;

Sequential BlockParallel Block

// 1 Clock Cycle par{

a=1;b=2;c=3;

ENG6530 RCS 123

TimingTiming

ENG6530 RCS 124

Additional Features & StatementsAdditional Features & Statements

Concurrency...par{

{}…{ …}

ENG6530 RCS 125

Par Completion: Soln

ENG6530 RCS 126

MoreMore Parallelism Parallelism Example – array initialisation Sequential version takes 20 clock cycles20 clock cycles

for() loop has 1 cycle overhead for increment Parallel version takes 1 clock cycle1 clock cycle

Replicated par() builds hardware to execute all 20 iterations in a single cycle

Allows trade-off between hardware size and performance

for(i=0;i<10;i++){ array[i]=0;}

Sequential code Parallel code

par(i=0;i<10;i++){ array[i]=0;}

ENG6530 RCS 127

While Loops

ENG6530 RCS 128

Example: Conditional Operators

ENG6530 RCS 129

Arrays, RAMs and ROMsArrays, RAMs and ROMs Handel-C easily allows designers to declare arrays of registers, ROMs

and RAMs. An array of registers array of registers is declared like an array in C. All the registers

may be accessed in parallel

This array can be turned into a ROM or RAMcan be turned into a ROM or RAM by putting the appropriate keyword in front. Only one location may then be accessed per clock cycle

unsigned 8 Data[256];

ram unsigned 8 Data1[256];rom unsigned 8 Data2[256];

// Array & RAM access example {

A = Data2[1]; // Read array, RAM or ROMData1[11] = 3; // Write to Array or RAM

ENG6530 RCS 130

Additional Features & StatementsAdditional Features & Statements

Using external and internal RAM / ROMRAMs and ROMs may only have one entrymay only have one entry

accessed in any clock cycle

More efficientMore efficient to implement in terms of h/w

resources than arrays & allow a non-constant

Handel-C compiler can infercompiler can infer width, type and

#entries.

ENG6530 RCS 131

RAM Access from Handel-CRAM Access from Handel-C

Handel-C allows you access to a number of different types of RAM:

1)1) Distributed RAMDistributed RAM, which is implemented in look-up tables in the logic blocks of FPGAs.

2)2) Block RAMBlock RAM, which is available on certain chips.

3)3) Off-chip RAMOff-chip RAM

ENG6530 RCS 132

(1) Distributed RAM(1) Distributed RAM

Internal RAM / ROM

ram unsigned int 8 myram[256];rom unsigned int 8 program[] = {1,2,3,4};unsigned char i;i = 3;myram[i] = 25;for (i = 0; i < 4; i++)

stdout ! program[i];

ENG6530 RCS 133

(2) Block RAM(2) Block RAM Block RAM (Single Port)

ram unsigned 8 MyRam[512] with {block = 1}; Block RAM (Dual Port)

ram unsigned 8 ReadWriteA[512];

ram unsigned 8 ReadWriteB[512]’

MyRam with {block=1};

ENG6530 RCS 134

RAM Access: Use RegistersRAM Access: Use Registers To minimize the logic for external, distributed and

block RAM accesses, it is best to use registers directly for address and dataRead – supply the data and address from a

register directly (no expression)

MyRam[MyAddressReg] = MyDataReg;Write – supply the address directly from a

register and read the data directly into a register

MyDataReg = MyRam[MyAddressReg];

ENG6530 RCS 135

(3) OffChip RAM(3) OffChip RAM External RAM / ROM

ram unsigned int 4 ExtRAM[8] with {offchip = 1,

data = {"P01", "P02", "P03", "P04"},

addr = {"P05", "P06", "P07"},

we = {"P08"}, oe = {"P09"}, cs = {"P10"} };

rom unsigned int 4 ExtROM[8] with {offchip = 1,

data = {"P01", "P02", "P03", "P04"},

addr = {"P05", "P06", "P07"},

we = {}, oe = {"P09"}, cs = {"P10"}

ENG6530 RCS 136

Synthesizable ANSI-C for hardwareSynthesizable ANSI-C for hardware

ENG6530 RCS 137

Porting “C” to Handel-C Porting “C” to Handel-C

1. Decide how software maps to hardware platform2. Partition algorithm between multiple FPGAs3. Port C to Handel-C & use simulator to check correctness4. Modify code to take advantage of extra operators in

Handel-C - simulate to ensure correctness5. Add fine-grain parallelism through PAR & parallel

assignments or parallelize algorithm - simulate6. Add hardware interfaces for target architecture & map

simulator channels communications onto these interfaces - simulate

7. Use FPGA place & route tools to generate FPGA images

ENG6530 RCS 138

Summary: Handel-CSummary: Handel-C C-based programming language for digital system design. One clock-cycle per statement. Explicit parallelism. Compiler generates hardware design from Handel-C

source. Additions:

support for parallelismparallelism (PAR Statement) channels for communicationscommunications between parallel processes operators for detailed controldetailed control of hardware constructs for RAM, ROMRAM, ROM, interfacing, etc.

ENG6530 RCS 139

ENG6530 Reconfigurable Computing Systems High Level Languages High Level Languages “Electronic...

Documents

Transcript of ENG6530 Reconfigurable Computing Systems High Level Languages High Level Languages “Electronic...

ENG6530 Reconfigurable Computing Systemsislab.soe.uoguelph.ca/sareibi/TEACHING_dr/ENG6530_RCS_html_dr/... · ENG6530 Reconfigurable Computing Systems General Information Handout Fall

1 CSC 533: Programming Languages Spring 2012 Background machine assembly high-level languages software development methodologies key languages.

Computer Software. Evolution of Programming Languages Machine Languages Assembly Languages High-Level Languages Fourth-Generation Languages.

A Level Languages 2017-19

High-Level Programming Languages for Bio-molecular …web.mit.edu/jakebeal/www/Publications/BioHLLChapter.pdf · HIGH-LEVEL PROGRAMMING LANGUAGES FOR BIO-MOLECULAR SYSTEMS ... for

Chapter 8: High-Level Programming Languages

ENG3050/ENG6530 Reconfigurable Computing Systems Review Review & Final Exam.

GCSE Subject Level Guidance for Ancient Languages · GCSE Subject Level Guidance for Ancient Languages ... (the ‘compulsory literature component’). ... GCSE Subject Level Guidance

CHAPTER 8 – High-Level Programming Languages

AS and A-level languages - University of Birmingham · AS and A-level languages The accredited specifications . A-level French, German and Spanish First exam for A-level ... Sophie

ENG6530 Reconfigurable Computing Systems Introduction to Reconfigurable Computing Introduction to Reconfigurable Computing.

Mixed level languages classes at Year 8

Other High-Level Design Languages - Stanford Universityinfolab.stanford.edu/~ullman/fcdb/aut07/slides/uml-odl.pdf · Other High-Level Design Languages ... Standards group: ODMG =

GCSE Subject Level Guidance for Ancient Languages

Translating High Level Languages

A Level Modern Languages for AQA

A Level Languages

By Tien Phung CS 147 Dr. Sin-Min Lee. High-level Languages Assembly Languages Machine Languages.

Floating-Point and High-Level Languages Programming Languages Spring 2004.

1 CSC 533: Programming Languages Spring 2014 Background machine assembly high-level languages software development methodologies key languages.