Post on 25-Dec-2015
ENG6530 Reconfigurable
Computing Systems
High Level LanguagesHigh Level Languages
““Electronic System Level (ESL) Electronic System Level (ESL) Design”Design”
ENG6530 RCS 2
Topics
Issues with Reconfigurable ComputingIssues with Reconfigurable Computing Complexity of Applications …Complexity of Applications … Complexity of the Design CycleComplexity of the Design Cycle
Electronic System Level (ESL)Electronic System Level (ESL) Motivation, Why?Motivation, Why? Advantages/DisadvantagesAdvantages/Disadvantages SummarySummary
ENG6530 RCS 3
ReferencesReferences ““Reconfigurable Computing: The Theory & Practice of FPGA Based Reconfigurable Computing: The Theory & Practice of FPGA Based
Computing”, by S. Hauck and A. Dehon, 2008.Computing”, by S. Hauck and A. Dehon, 2008. ““Leading Languages: Is There a Future Beyond RTL”, FPGA Journal Leading Languages: Is There a Future Beyond RTL”, FPGA Journal
2005.2005. ““The Challenges of Synthesizing Hardware from C-like Languages”, The Challenges of Synthesizing Hardware from C-like Languages”,
by Stephen Edwards.by Stephen Edwards. ““Design of a high-level language for Custom Computing Machines”, Design of a high-level language for Custom Computing Machines”,
C. Van Reeuwijk, 2002.C. Van Reeuwijk, 2002.
“Comparison of VHDL, Verliog and SystemVerliog”, Stephen Bailey, Model Technology.
http://www.SystemC.org (System-C) http://www.celoxica.com (Handel-C) http://www.mentor.com (Catapult-C) http://www.xilinx.com (AutoESL)
ENG6530 RCS 4
Key Markets for HPCKey Markets for HPCHow are we going to manageHow are we going to manage
Design Complexity?Design Complexity?
ENG6530 RCS 5
One important practical approach to handle complexity is to raise the level of raise the level of abstractionabstraction
We can take guidance from previous shifts previous shifts in methodologyin methodology which raised the level of abstraction
- from schematics to HDLs - from assembler code to HLLs
Managing ComplexityManaging Complexity
ENG6530 RCS 6
Complexity of Design CycleComplexity of Design Cycle
ENG6530 RCS 7
1. You can’t get your hardware designs done quickly enough Designs are getting too complex to handleDesigns are getting too complex to handle (SOC)
2. You haven’t enough experienced hardware designers
3. Errors in design or unimplemented features cost $
4. ASICs and development tools costly $costly $
5. Software development stalls waiting for the hardwarestalls waiting for the hardware
Why do companies face these Why do companies face these problems?problems?
ENG6530 RCS 8
A need for a new design LanguageA need for a new design Language
Verilog and VHDL work very well for HW implementation flows but …but …
1. They are too complicated for casual use.2. Systems are becoming more complex, pushing us to
design and verify at higher levels of abstraction.3. Designers often implement today’s systems as a mix of
hardware and software (which should be Hw/Sw?) It is essential that new design flows support early
software development, integration with existing C/C++ code, and HW/SW co-designHW/SW co-design. Using a single language Using a single language like C simplifies migration task!like C simplifies migration task!
4. If we synthesize hardware from C like languages we can effectively turn every C programmer into a hardware designer!!
High-level SynthesisHigh-level Synthesis
Wouldn’t it be nice to write high-level code? Ratio of C to VHDL developers (10000:1 ?) + Easier to specify + Separates function from architecture
+ More portable - Hardware potentially slower
Similar to assembly code era Programmers could always beat compiler But, no longer the case
Hopefully, high-level synthesis will catch up to manual effort
ENG6530 RCS 10
Abstraction: AdvantagesAbstraction: Advantages
ENG6530 RCS 11
Why not a Software Language for Why not a Software Language for Design Entry??Design Entry??
The semantics of “C” and similar languages are distant enough from hardware (Execution Models!!) Software follows a sequential model Hardware is fundamentally concurrent.
C language has no support for user specified parallelism So either the synthesis tool must find it a difficult task Or the designer must use language extensions and insert explicit
parallelism (programmer will have to think differently to design hardware).
Techniques for synthesizing hardware from C either generate inefficient hardware or propose a language that merely adopts parts of C syntax.
ENG6530 RCS 12
Advantages of HLLs for Hardware DesignAdvantages of HLLs for Hardware Design
1. Designs are often specified by a C/C++ executable Some problems are better expressed as a software algorithm Software Reference designs can be utilized
2.2. Enables much higher speed Enables much higher speed verificationverification Faster Simulation Faster Simulation at architecture level than gate level Reduce RisksReduce Risks by enabling early verification of the entire system.
3. Software development techniques can be used4. Simplifies hardware-software partitioning5. Brings hardware and software teams closer together
ENG6530 RCS 13
Requirements for New Language?Requirements for New Language? Don’t invent a new language! Build on C/C++ so that:
Extensive C/C++ infrastructure (compilers, debuggers, language standards, books, e.t.c.) can be re-used.
Users’ existing knowledge of C/C++ can be leveraged. Integration with existing C/C++ code is easy
It must support specification and refinementsupport specification and refinement to detailed implementation of both software and hardware.
It must support verificationsupport verification through all stages of the design process.
It must provide a very general set of modeling constructsprovide a very general set of modeling constructs to cleanly support the wide range of abstraction levels and models of computation used in system design.
ENG6530 RCS 14
Semiconductor DesignSemiconductor Design
HandCrafted
SchematicCapture
VHDL /Verilog
SystemLevel
Design
1970’s 1980’s 1990’s 2000’s
In house
Cut rubies(manual)
DaisyMentorValid
CalmaInternal
SynopsysCadenceMentor
DraculaCadenceAvant!
FRONTEND
BACKEND
Handel-C
SystemC
SystemVerilog
CatapultC
ImpulseC
AutoESL
ENG6530 RCS 15
Ease of Use vs. EfficiencyEase of Use vs. Efficiency
Easy Ease of Use Difficult
Lo
w
Eff
icie
ncy
H
igh
Verilog
VHDL
SystemCHandel-C
SystemVerilog
CatapultC
ImpulseC
Vivado HLS
16
Contrasting ESLs
Handel-C VIVA Mitrion C
Hardware Software
HDL Impulse-C
VIVADO HLSSystemC
Explicit Par StatementsMemory StatementsChannels, …
Pure C/C++ statementswith Pragmas inserted
C to FPGA Accelerated System
Algorithm Design
EDIF
FPGA
Function & Architecture
Implementation
Mixed Simulation
C for HWCA
C/C++AL
API’s/Libraries
OBJ
Processor
SoftwareModel
Specification Model
TestbenchDesign
HW SW
Partitioning
System Model
Design AnalysisOptimization
P&RSynthesis
RTL
C-Based Synthesis
ArchitectureExploration
BSPBSP
COMMS
Commercial RC Applications
Well established in embedded systems:
Digital Video Technology and Image Processing “PROCESSING AT THE SENSOR” versus local and/or remote processing 3D LCD display development and test Real-time verification of HDTV image processing algorithms Robust image matching - product tracking and production line control
Digital Signal Processing Engine control unit for 3-phase motors Radar and sonar beam forming and spatial filtering Computer aided tomography security system
Communications and Networking
Internet reconfigurable multimedia terminal, MP3, VoIP etc. Ground traffic simulation test bed for broadband satellite network communications Satellite based Internet data tracking system
Rapid Systems Prototyping Automotive safety system incorporating sensor fusion Robotic vision system for object detection and robot guidance
Defense & Security
Consumer
Automotive & Industrial
…using C-based design
ENG6530 RCS 19
SummarySummary Systems are too complicatedtoo complicated today to rely on Hardware
Descriptive Languages such as VHDL or Verilog. New Languages have emergedNew Languages have emerged such as SystemC,
Handel-C, CatapultC, ImpulseC, … Some of these languages are
Suitable for system verification (speedup the simulation of the system).
Suitable for synthesis Suitable for architecture exploration Suitable for Hardware/Software Co-design
Challenges: Efficiency of synthesizers (Performance, Area, Power) Learning curve
ENG6530 RCS 20
ENG6530 Reconfigurable
Computing Systems
High Level SynthesisHigh Level Synthesis
ENG6530 RCS 21
Design Entry
Logic Optimization
Synthesis
Mapping to k-LUT
Packing LUTs to CLBs
Placement
Routing Configure an FPGA
Simulation
CAD for FPGAs: Synthesis
FPGA Tool Flow with ESLFPGA Tool Flow with ESL
HDL
Netlist
Bitfile
Processor FPGA
RT Synthesis
Physical Design
Technology Mapping
Placement
Routing
High-level Synthesis
C/C++, Java, etc.
22
High Level Synthesis
ConstraintsAreaTime: Clock Period Nr. of clock stepsPower
+ -
* <
LibraryWHILE G < K LOOP F := E*(A+B); G := (A+B)*(C+D);END LOOP;
Algorithm
A C B D EX
Y
F G
K
+ *
<
Datapath
PLA
Latches
Controller
23
High-level SynthesisHigh-level Synthesis• First, consider how to manually convert high-level code into First, consider how to manually convert high-level code into
circuitcircuit
• StepsSteps– 1) 1) Build FSM for controllerBuild FSM for controller
– 2) 2) Build datapath based on FSMBuild datapath based on FSM
acc = 0;for (i=0; i < 128; i++) acc += a[i];
24
Manual ExampleManual Example• Build a FSM (controller)Build a FSM (controller)
– Decompose code into statesDecompose code into states
acc = 0;for (i=0; i < 128; i++) acc += a[i];
if (i < 128)
acc=0, i = 0
load a[i]
acc += a[i]
i++
Done
25
Manual ExampleManual Example• Build a datapathBuild a datapath
– Allocate resources for each stateAllocate resources for each state
acci
if (i < 128)
acc=0, i = 0
load a[i]
acc += a[i]
i++
Done
<
addra[i]
++ +
1 128 1
acc = 0;for (i=0; i < 128; i++) acc += a[i]; 26
Manual ExampleManual Example• Build a datapathBuild a datapath
– Determine register inputsDetermine register inputs
acci
if (i < 128)
acc=0, i = 0
load a[i]
acc += a[i]
i++
Done
<
addra[i]
++ +
1 128
2x1
0
2x1
0
1
2x1
&a
In from memory
acc = 0;for (i=0; i < 128; i++) acc += a[i];
27
Manual ExampleManual Example• Build a datapathBuild a datapath
– Add outputsAdd outputs
acci
if (i < 128)
acc=0, i = 0
load a[i]
acc += a[i]
i++
Done
<
addra[i]
++ +
1 128
2x1
0
2x1
0
1
2x1
&a
In from memory
Memory addressacc
acc = 0;for (i=0; i < 128; i++) acc += a[i]; 28
Manual ExampleManual Example• Build a datapathBuild a datapath
– Add control signalsAdd control signals
acci
if (i < 128)
acc=0, i = 0
load a[i]
acc += a[i]
i++
Done
<
addra[i]
++ +
1 128
2x1
0
2x1
0
1
2x1
&a
In from memory
Memory addressacc
acc = 0;for (i=0; i < 128; i++) acc += a[i]; 29
Manual ExampleManual Example• Combine controller+datapathCombine controller+datapath
acci
<
addra[i]
++ +
1 128
2x1
0
2x1
0
1
2x1
&a
In from memory
Memory addressaccDone Memory Read
Controller
acc = 0;for (i=0; i < 128; i++) acc += a[i]; 30
Manual ExampleManual Example
• Comparison with high-level synthesisComparison with high-level synthesis– Determining when to perform each operationDetermining when to perform each operation
• => Scheduling=> Scheduling
– Allocating resource for each operationAllocating resource for each operation
• => Resource allocation=> Resource allocation
– Mapping operations onto resourcesMapping operations onto resources
• => Binding=> Binding
ENG6530 RCS32
Behavioral SynthesisBehavioral Synthesis
Algorithm
I/O Behavior
Target Library
Behavioral Synthesis
RTL Design
LogicSynthesis
Gate level Netlist
Classic RTL Design Flow
• Resource Allocation
• Scheduling
• Binding
HLS: Main StepsHLS: Main Steps
Syntactic Analysis
Optimization
Scheduling/Resource Allocation
Binding/Resource Sharing
High-level Code
Intermediate Representation
Controller + Datapath
Converts code to intermediate representation - allows all following steps to use language independent format.
Determines when Determines when each operation will execute, and resources usedresources used
Maps operations onto physical resources
Front-end
Back-end
33
Intermediate RepresentationIntermediate Representation
• Parser converts tokens to intermediate representationParser converts tokens to intermediate representation– Usually, an abstract syntax treeUsually, an abstract syntax tree
x = 0;if (y < z) x = 1;d = 6;
Assign
if
cond assign assign
x 0
x 1 d 6y < z
Intermediate RepresentationIntermediate Representation
• Why use intermediate representation?Why use intermediate representation?– Easier to analyze/optimize than source codeEasier to analyze/optimize than source code– Theoretically can be used for all languagesTheoretically can be used for all languages
• Makes synthesis back end language independentMakes synthesis back end language independent
Syntactic Analysis
C Code
Intermediate Representation
Syntactic Analysis
Java
Syntactic Analysis
Perl
Back End
Scheduling, resource allocation, binding, independent of source language - sometimes optimizations too
SchedulingScheduling
• Scheduling assigns a start time to each Scheduling assigns a start time to each operation in DFGoperation in DFG– Start times must not violate dependencies in DFGStart times must not violate dependencies in DFG– Start times must meet performance constraintsStart times must meet performance constraints
• Alternatively, resource constraintsAlternatively, resource constraints
• Performed on the DFG of each CFG nodePerformed on the DFG of each CFG node– => Can’t execute multiple CFG nodes in parallel=> Can’t execute multiple CFG nodes in parallel
Scheduling ExamplesScheduling Examples
+
+
+
a b c d
Cycle1
Cycle2
Cycle3
+ +
+
a b c d
Cycle3
Cycle1 Cycle2
+ +
+
a b c d
Cycle1
Cycle2
Scheduling ProblemsScheduling Problems
• Several types of scheduling problemsSeveral types of scheduling problems– Usually some combination of performance and resource Usually some combination of performance and resource
constraintsconstraints
• Problems:Problems:1.1. UnconstrainedUnconstrained
1.1. Not very useful, every schedule is validNot very useful, every schedule is valid
2.2. Minimum latencyMinimum latency3.3. Latency constrainedLatency constrained4.4. Mininum-latency, resource constrainedMininum-latency, resource constrained
• i.e. find the schedule with the shortest latency, that uses less i.e. find the schedule with the shortest latency, that uses less than a specified # of resourcesthan a specified # of resources
• NP-CompleteNP-Complete
5.5. Mininum-resource, latency constrainedMininum-resource, latency constrained• i.e. find the schedule that meets the latency constraint (which i.e. find the schedule that meets the latency constraint (which
may be anything), and uses the minimum # of resourcesmay be anything), and uses the minimum # of resources• NP-CompleteNP-Complete
Minimum Latency SchedulingMinimum Latency Scheduling
• ASAP (as soon as possible) algorithmASAP (as soon as possible) algorithm– Find a Find a candidate nodecandidate node
• Candidate is a node Candidate is a node whose whose predecessorspredecessors have been scheduled and have been scheduled and completed (or has no predecessors)completed (or has no predecessors)
– Schedule node one cycle later than max cycle of predecessorSchedule node one cycle later than max cycle of predecessor
– RepeatRepeat until all nodes scheduled until all nodes scheduled
+ +
*
a b c d
*
- <
e f g h
Cycle1
Cycle2
Cycle3
+Cycle4
Minimum possible latency - 4 cycles
Minimum Latency SchedulingMinimum Latency Scheduling
• ALAP (as late as possible) algorithmALAP (as late as possible) algorithm– Run ASAP, get minimum latency L Run ASAP, get minimum latency L
– Find a candidateFind a candidate• Candidate is Candidate is node whose successorsnode whose successors are scheduled (or has none) are scheduled (or has none)
– Schedule node one cycle Schedule node one cycle before minbefore min cycle of predecessor cycle of predecessor• Nodes with no successors scheduled to cycle LNodes with no successors scheduled to cycle L
– Repeat Repeat until all nodes scheduleduntil all nodes scheduled
+ +
*
a b c d
*
- <
e f g h
Cycle1
Cycle2
Cycle3
+Cycle4
Cycle4
Cycle3
L = 4 cycles
Minimum Latency SchedulingMinimum Latency Scheduling
• ALAP (as late as possible) algorithmALAP (as late as possible) algorithm– Run ASAP, get minimum latency L Run ASAP, get minimum latency L
– Find a candidateFind a candidate• Candidate is node whose successors are scheduled (or has none)Candidate is node whose successors are scheduled (or has none)
– Schedule node one cycle Schedule node one cycle before minbefore min cycle of predecessor cycle of predecessor• Nodes with no successors scheduled to cycle LNodes with no successors scheduled to cycle L
– Repeat until all nodes scheduledRepeat until all nodes scheduled
+ +
*
a b c d
* -
<
e f g h
Cycle1
Cycle2
Cycle3
+Cycle4
L = 4 cycles
Minimum Latency SchedulingMinimum Latency Scheduling
• ALAPALAP
– Has to run ASAP first, seems pointlessHas to run ASAP first, seems pointless– But, many heuristics need the mobility/slack of But, many heuristics need the mobility/slack of
each operationeach operation• ASAP gives the earliest possible time for an operationASAP gives the earliest possible time for an operation• ALAP gives the latest possible time for an operationALAP gives the latest possible time for an operation
– Slack = difference between earliest and latest Slack = difference between earliest and latest possible schedulepossible schedule
• Slack = 0 implies operation has to be done in the current Slack = 0 implies operation has to be done in the current scheduled cyclescheduled cycle
• The larger the slack, the more options a heuristic has to The larger the slack, the more options a heuristic has to schedule the operationschedule the operation
BindingBinding
• During scheduling, we determined:During scheduling, we determined:– When ops will executeWhen ops will execute
– How many resources are neededHow many resources are needed
• We still need to decide which ops execute on which resourcesWe still need to decide which ops execute on which resources– => Binding=> Binding
– If multiple ops use the same resourceIf multiple ops use the same resource
• =>Resource Sharing=>Resource Sharing
BindingBinding
• Basic Idea - Map operations onto resources such that Basic Idea - Map operations onto resources such that operations in same cycle don’t use same resourceoperations in same cycle don’t use same resource
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 ALUs (+/-), 2 Multipliers
Mult1 ALU1 ALU2 Mult2
BindingBinding
• Many possibilitiesMany possibilities
– Bad binding may increase resources, require huge Bad binding may increase resources, require huge steering logic, reduce clock, etc.steering logic, reduce clock, etc.
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 ALUs (+/-), 2 Multipliers
Mult1 ALU1 ALU2Mult2
ENG6530 RCS 46
ENG6530 Reconfigurable
Computing Systems
Xilinx Vivado Xilinx Vivado
High Level Synthesis (HLS)High Level Synthesis (HLS)
Or AutoESLOr AutoESL
High-Level Synthesis: HLS
• High-Level Synthesis– Creates an RTL implementation from C
level source code
– Extracts control and dataflow from Extracts control and dataflow from the source codethe source code
– Implements the design Implements the design based on defaults and user applied directivesuser applied directives
• Many implementation are possible from the same source description– Smaller designs, faster designs, optimal
designs
– Enables manualmanual design exploration
ENG6530 RCS11- 47
AutoESL or Vivado HLSAutoESL or Vivado HLS
Script withConstraintsScript withConstraints
RTL Wrapper
RTL Wrapper
………………………………………
………………………VHDLVerilog
System C
VHDLVerilog
System C
AutoESLAutoESL
Test benc
h
Test benc
h Constraints/ Directives
Constraints/ Directives
………………
………………
………………
………………C, C++,
SystemC
C, C++, System
C
RTL SimulationRTL Simulation RTL SynthesisRTL Synthesis
• The primary commands have toolbar buttons– Easy access for standard tasks
– Button highlights when the option is available• E.g. cannot perform C/RTL simulation before synthesis
Create a new Project
Create a new Project
Create a new Solution
Create a new Solution
Change Solution Settings
Change Solution Settings
Change Project Settings
Change Project Settings
Run C SimulationRun C SimulationRun C SynthesisRun C Synthesis
Export RTLExport RTL
Open ReportsOpen Reports
Open Analysis Viewer
Open Analysis Viewer
Compare ReportsCompare Reports
Run C/RTL CosimulationRun C/RTL Cosimulation
Using Vivado HLS 12 - 48
Vivado HLS GUI ToolbarVivado HLS GUI Toolbar
ENG6530 RCS
Design Exploration with Directives
The same hardware is used for each iteration of the loop:
•Small area•Long latency
•Low throughput
Different iterations are executed concurrently:•Higher area
•Short latency •Best throughput
… loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } }….
Different hardware is used for each iteration of the loop:
•Higher area•Short latency
•Better throughput
Before we get into details, let’s look under the hood ….
One body of code: Many hardware outcomes
Design Exploration with DirectivesDesign Exploration with Directives
ENG6530 RCS11- 49
• Perspective for design analysis– Allows interactive analysisAllows interactive analysis
Using Vivado HLS 12 - 50
Analysis PerspectiveAnalysis PerspectiveAnalysis Perspective
ENG6530 RCS
Introduction to High-Level Synthesis
• How is hardware extracted from C code?– Control and datapath can be extracted from C code at the top levelat the top level– The same principles used in the example can be applied to sub-
functions• At some point in the top-level control flow, control is passed to a sub-
function
• Sub-function may be implemented to execute concurrently with the top-level and or other sub-functions
• How is this control and dataflow turned into a hardware design?– AutoESL maps this to hardware through scheduling and binding scheduling and binding
processes
• How is my design created?– How functions, loops, arrays and IO ports are mapped?
ENG6530 RCS11- 51
Hardware ExtractionHardware Extraction
void fir ( data_t *y, coef_t c[4], data_t x ) {
static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;}
HLS: Control Extraction
Code
From any C code example …
The loops in the C code correlated to states of behavior
Function Start
For-Loop Start
For-Loop End
Function End
00
22
Control Behavior
11
Finite State Machine (FSM) states
This behavior is extracted into a hardware state machine
ENG6530 RCS11- 52
Control ExtractionControl Extraction
void fir ( data_t *y, coef_t c[4], data_t x ) {
static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;}
HLS: Control & Datapath Extraction
Code
From any C code example …
00
22
Control Behavior
11
Finite State Machine (FSM) states
The control is known
Operations
Operations are extracted…
-==+
>=
*+*
RDx
WRy
RDc
Control & Datapath Behavior
A unified control dataflow behavior is created
Control Dataflow
>=
-
+
==
*
+ *
WRy
-
RDx RDc
ENG6530 RCS11- 53
Control & Datapath ExtractionControl & Datapath Extraction
High-Level Synthesis: Scheduling & Binding
• Scheduling & Binding– Scheduling and Binding are at the heart of HLS
• Scheduling determines in which clock cycle an operation will occurwhich clock cycle an operation will occur– Takes into account the control, dataflow and user directives
– The allocation of resources can be constrained
• Binding determines which library cell is used for each operationwhich library cell is used for each operation– Takes into account component delays, user directives
Design Source(C, C++, SystemC)
Scheduling Binding
RTL(Verilog, VHDL, SystemC)
Technology Library
User Directives
ENG6530 RCS11- 54
HLS: Scheduling and Binding
• The operations in the control flow graph are mapped into clock cycles
• The technology and user constraints impact the schedule– A faster technology (or slower clock) may allow more operations to occur in
the same clock cycle
• The code also impacts the schedule– Code implications and data dependencies must be obeyed
Scheduling
void foo ( … t1 = a * b; t2 = c + t1; t3 = d * t2; out = t3 – e; }
+
*abc
-
*de out
* -*+Schedule 1
* -*+Schedule 2
ENG6530 RCS11- 55
Scheduling
Binding
• Binding is where operations are mapped to cores from the hardware library– Operators map to cores
• Binding Decision #1:– Given this schedule:
• Binding must use 2 multipliersuse 2 multipliers, since both are in the same cycleboth are in the same cycle• It can decide to use an adder and subtractor or one addsub
• Binding Decision #2: – Given this schedule:
• Binding may decide to share the multipliers share the multipliers (each is used in a different cycle)
• What affects the decision made by the Scheduler/Binder?• Timing and availability of resources.Timing and availability of resources.
• Binding may decide the cost of sharing (muxing) would impact timing impact timing and it may decide not to share them
• Binding may make this same decision in the first example above too
* >*+
* -*+
ENG6530 RCS11- 56
Binding
Understanding AutoESL Synthesis
• HLS– AutoESL determines in which cycle operations should occur (schedulingscheduling)– Determines which hardware units to use for each operation (bindingbinding)– It performs HLS by It performs HLS by :
I.I. ObeyingObeying built-in defaultsII.II. Obeying user directives Obeying user directives & constraints to override defaultsIII.III. Calculating Calculating delays and area using the specified technology/device
• Understand AutoESL defaults – Key to understanding the initial design created by AutoESL
• Understand the priority of directives1. Meet Performance (clock & throughput)
• AutoESL will allow a local clock path to fail if this is required to meet throughput• Often possible the timing can be met after logic synthesis
2. Then minimize latency3. Then minimize area
ENG6530 RCS11- 57
Understanding AutoESL Synthesis
• The vast majority of C, C++ and SystemC is supported– Provided it is statically defined at compile time
– If it’s not defined until run time, it won’ be synthesizableit won’ be synthesizable
• Any of the three variants of C can be used– If CC is used, Vivado HLS expects the file extensions to be .cto be .c
– For C++ and SystemC C++ and SystemC it expects file extensions .cpp.cpp
C, C++ and SystemC Support
Coding Considerations 23- 58
C, C++ and SystemC Support
• System calls and function pointers– Dynamic memory allocation
• malloc() & free()– Standard I/0 and file I/O operations
• fprintf() / fscanf() etc.– System calls
• time(), sleep() etc.
• Data types– Forward declared type– Recursive type definitions
• Type contains members with the same type
• Non-standard Pointers– Pointer casting between general data types
• OK with native integers types– If a double pointer double pointer is used in multiple functions, Vivado HLS will inline all the
functions • Slower synthesis, may increase area & run time
Unsupported Constructs: Overview
Coding Considerations 23- 59
Unsupported Constructs
The Key Attributes of C code
void fir ( data_t *y, coef_t c[4], data_t x ) {
static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i] * c[i];} } *y=acc;}
Functions: All code is made up of functions which represent the design hierarchy: the same in hardware
Loops: Functions typically contain loops. How these are handled can have a major impact on area and performancearea and performance
Arrays: Arrays are used often in C code. They can influence the device IOdevice IO and become performance bottlenecks
Operators: Operators in the C code may require sharing to control area or specific hardware implementations to meet performance
Types: All variables are of a defined type. The type can influence the area and performancearea and performance
Let’s examine the default synthesis behavior of these …
Top Level IO : The arguments of the top-level function determine the hardware RTL interface portshardware RTL interface ports
ENG6530 RCS11- 60
The Key Attributes of C Code
void fir ( data_t *y, coef_t c[4], data_t x ) {
static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;}
Types = Operator Bit-sizes
Code
From any C code example ...
Operations
Operations are extracted…
-==+
>=
*+*
RDx
WRy
RDc
Types
The C types define the size of the hardware used: handled automatically
long long (64-bit)
int (32-bit)
short (16-bit)
char (8-bit)
double (64-bit)float (32-bit)
unsigned types
Standard C types
For floats and doubles there must be a FP core in the library binding can map to, else cannot be synthesized
Arbitary Precision types
C: ap(u)int types (1-1024)
C++: ap_(u)int types (1-1024) ap_fixed types
C++/SystemC: sc_(u)int types (1-1024)sc_fixed types
Can be used to define any variable to be a specific bit-width (e.g. 17-bit, 47-bit etc).
ENG6530 RCS11- 61
Types = Operator Bit-Sizes
• Code using native C int type
• However, if the inputs will only have a max range of 8-bit– Arbitrary precision data-types should be used
– It will result in smaller & faster hardware smaller & faster hardware with the full required precision– With arbitrary precision types on function interfaces, Vivado HLS can
propagate the correct bit-widths throughout the designpropagate the correct bit-widths throughout the design
Why is arbitrary precision Needed?
Data Types 14- 62
Why Arbitrary Precision?
• There are 4 basic types you can use for HLS– Standard C/C++ Types
– Vivado HLS enhancements to C: apint
– Vivado HLS enhancements to C++: ap_int, ap_fixed
– SystemC types
HLS & C Types
14- 63Data Types 14- 63
HLS &C Types
• For C – Vivado HLS types apint can be used
– Range: 1 to 1024 bits
– Specify the integers as shown and just use them like any other variable
• There are two issues to be aware of – C compilation : YOU MUST use apcc MUST use apcc to simulate (no debugger support)
– Be aware of integer promotion issues
Arbitrary Precision : C apint types
#include ap_cint.h
void foo_top (…) { int9 var1; // 9-bit uint10 var2; // 10-bit unsigned
#include ap_cint.h
void foo_top (…) { int9 var1; // 9-bit uint10 var2; // 10-bit unsigned
Include header file
Failure to use apcc to compile the C will result in INCORRECT results
This only applies to CNOT C++ or SystemC
14- 64Data Types 14- 64
C apint Types
• apcc– Command line compatible with gcc
– Required to support arbitrary precision for C
– Use apcc at the Vivado HLS CLI (shell)
• apcc understands bit-accurate types
– Once you create bit-accurate types you must re-validate the C
– It’s the only way to discover rounding and truncation issues• It’s fast in C !!!
#include “ap_cint.h”int3 ex_bit_accurate ( int3 x1, int3 y1 ) { return x1+y1;}
#include “ap_cint.h”int3 ex_bit_accurate ( int3 x1, int3 y1 ) { return x1+y1;}
+
Given: x1=2 y1=2
00 11 00
00 11 00
11 00 00
apcc simulation
22
-4
Simulates as
hardware
00 00 11 0000 00 00……00 00 11 0000 00 00……
00 11 00 0000 00 00……+
gcc simulation
224
x1y1
return
shell> apcc –o my_test test.c test_tb.c shell> apcc –o my_test test.c test_tb.c
Data Types 14- 65
Using apcc
• Support for fixed point datatypes in C++– Include the path to the ap_fixed.h header file
– Both signed (ap_fixed) and unsigned types (ap_ufixed)
• Advantages of Fixed Point types– The result of variables with different sizes is automatically taken care is automatically taken care of
– The binary point is automatically aligned• Quantization: UnderflowUnderflow is automatically handled• Overflow: SaturationSaturation is automatically handled
Arbitrary Precision : C++ ap_fixed types
#include ap_fixed.h
void foo_top (…) { ap_fixed<9, 5, AP_RND_CONV, AP_SAT> var1; // 9-bit,
// 5 integer bits, 4 decimal places
ap_ufixed<10, 7, AP_RND_CONV, AP_SAT> var2; // 10-bit unsigned // 7 integer bits, 3 decimal places
#include ap_fixed.h
void foo_top (…) { ap_fixed<9, 5, AP_RND_CONV, AP_SAT> var1; // 9-bit,
// 5 integer bits, 4 decimal places
ap_ufixed<10, 7, AP_RND_CONV, AP_SAT> var2; // 10-bit unsigned // 7 integer bits, 3 decimal places
$VIVADO_HLS_HOME/include/ap_fixed.h
Alternatively, make the result variable large enough such that overflow or underflow does not occur14- 66Data Types 14- 66
C++ap_fixed types
• Fixed point types are specified by– Total bit width (W)
– The number of integer bits (I)
– The quantization/rounding mode (Q)
– The overflow/saturation mode (O)
– The number of saturation bits
Definition of ap_fixed type
DescriptionW Word length in bitsI The number of bits used to represent the integer value (the number of bits above the decimal point)
Q Quantization mode (modes detailed below) dictates the behavior when greater precision is generated than can be defined by the LSBs.AP_Fixed Mode DescriptionAP_RND Rounding to plus infinity AP_RND_ZERO Rounding to zero AP_RND_MIN_INF Rounding to minus infinity AP_RND_INF Rounding to infinity AP_RND_CONV Convergent rounding AP_TRN Truncation to minus infinity AP_TRN_ZERO Truncation to zero (default)
O Overflow mode (modes detailed below) dictates the behavior when more bits are required than the word contains.
AP_Fixed Mode DescriptionAP_SAT SaturationAP_SAT_ZERO Saturation to zeroAP_SAT_SYM Symmetrical saturationAP_WRAP Wrap around (default)AP_WRAP_SM Sign magnitude wrap around
N The number of saturation bits in wrap modes.
Binary point : W = I + B
ap_[u]fixed<W, I , Q, O , N> ap_[u]fixed<W, I , Q, O , N>
I-1I-1 -1-1 …… -B-B11 00……
14- 67
Data Types 14- 67
Definition of ap_fixed type
• Synthesis for floating point – Data types (IEEE-754 standard compliant)
• Single-precision 32 bit: 24-bit fraction, 8-bit exponent
• Double-precision 64 bit: 53-bit fraction, 11-bit exponent
• Support for Operators– Vivado HLS supports the Floating Point (FP) Vivado HLS supports the Floating Point (FP) cores for each Xilinx
technology• If Xilinx If Xilinx has a FP core, Vivado HLS supports it• It will automatically be synthesized
– If there is no such FP core in the Xilinx technology, it will not be in If there is no such FP core in the Xilinx technology, it will not be in the librarythe library
• The design will be still synthesizedThe design will be still synthesized
Floating Point Support
14- 68Data Types 14- 68
Floating Point Support
14- 69
Floating Point Cores
• Vivado HLS provides support for many math functions– Even if no floating-point core exists– These functions are implemented in a bit-approximate manner– The results may differ within a few Units of Least Precision (ULP) to the C/C++
standards
• If you Use math.h (C) or cmath.h (C++)– The functions will be synthesized will be synthesized automatically– The C simulation C simulation results may differ results may differ from the RTL simulation from the RTL simulation results– Use a test bench which checks for ranges: not == or !=
• If you replace math.h or cmath.h with Vivado HLS header file “hls_math.h” Or keep math/cmath and “add_files hls_lib.c”
– The C simulation C simulation will match will match the RTL simulation the RTL simulation – The C simulation may differ from the C simulation using math/cmath (or math/cmath
without hls_lib.c)
Support for Math Functions
More Details are available in the Coding Style Guide chapter in the User GuideMore Details are available in the Coding Style Guide chapter in the User Guide
Data Types 14- 70
Support for Math Functions
Loops • By default, loops are rolled
– Each C loop iteration Implemented in the same state – Each C loop iteration Implemented with same resourcessame resources
– IMPORTANT: Loops can be unrolled IMPORTANT: Loops can be unrolled if their indices are statically determinable at elaboration time
• Not when the number of iterations is variable
– Unrolled loops result in more elements to schedule Unrolled loops result in more elements to schedule but greater but greater operator mobilityoperator mobility
• Let’s look at an example ….
void foo_top (…) { ... Add: for (i=3;i>=0;i--) {
b = a[i] + b; ... } SynthesisSynthesis
foo_top++
NN
a[N]b
Loops require labels if they are to be referenced by Tcl directives
(GUI will auto-add labels)
ENG6530 RCS11- 71
Loops
Data Dependencies: Good
• Example of good mobility– The read on data port X can occur anywhere read on data port X can occur anywhere from the start to iteration 4
• The only constraint on RDx is that it occur before the final multiplication
– AutoESL has a lot of freedom with this operationAutoESL has a lot of freedom with this operation• It waits until the read is required, saving a register• There are no advantages to reading any earlier (unless you want it registered)• Input reads can be optionally registered
– The final multiplication is very constrained… The final multiplication is very constrained…
void fir ( …acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;}
+
==
-
>=RDx
*+
==
-
>=*+
==
-
>=*+
==
-
>=* WRy
Iteration 1 Iteration 2 Iteration 3 Iteration 4
- - -RDcRDcRDcRDc
The read X operation has good mobility
Default Schedule
ENG6530 RCS11- 72
Data Dependencies: Good Data Dependencies: Good
Data Dependencies: Bad
• Example of bad mobility– The final multiplication must occur before the readfinal multiplication must occur before the read and final addition
• It could occur in the same cycle if timing allows
– Loops are rolled by default• Each iteration cannot start till the previous iteration completes• The final multiplication (in iteration 4) must wait for earlier iterations to complete
– The structure of the code is forcing a particular schedule • There is little mobility for most operations
– Optimizations allow loops to be unrolled giving greater freedom
void fir ( …acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;}
+
==
-
>=RDx
*+
==
-
>=*+
==
-
>=*+
==
-
>=* WRy
Iteration 1 Iteration 2 Iteration 3 Iteration 4
- - -RDcRDcRDcRDc
Mult is very constrained
Default Schedule
ENG6530 RCS11- 73
Data Dependencies: Bad Data Dependencies: Bad
Schedule after Loop Optimization
• With the loop unrolled (completely)– The dependency on loop iterations is gone
– Operations can now occur in parallelOperations can now occur in parallel• If data dependencies allowIf data dependencies allow• If operator timing allowsIf operator timing allows
– Design finished fasterfaster but uses more operators• 2 multipliers & 2 Adders
• Schedule Summary– All the logic associated with the loop counters and
index checking are now gone
– Two multiplications can occur at the same time• All 4 could, but it’s limited by the number of input All 4 could, but it’s limited by the number of input
reads (2) on coefficient port Creads (2) on coefficient port C
– Why 2 reads on port C? • The default behavior for arrays now limits the schedule…
+
RDx
*
+*
**
WRy
RDc
RDc
+
RDc
RDc
void fir ( …acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;}
ENG6530 RCS11- 74
Schedule after Loop OptimizationSchedule after Loop Optimization
Arrays in HLS
• An array in C code is implemented by a memory in the RTL– By default, arrays are implemented as RAMs, optionally a FIFO
• The array can be targeted to any memory resource to any memory resource in the library– The ports (Address, CE active high, etc.) and sequential operation (clocks
from address to data out) are defined by the library model– All RAMs are listed in the AutoESL Library Guide
• Arrays can be merged can be merged with other arrays and reconfigured– To implement them in the same memory or one of different widths & sizes
• Arrays can be partitioned can be partitioned into individual elements– Implemented as smaller RAMs or registers
void foo_top(int x, …){ int A[N]; L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }
N-1N-1
N-2N-2
……
11
00
SynthesisSynthesis
foo_top
DOUTDINADDR
CEWE
SPRAMB
A[N]
A_outA_in
ENG6530 RCS11- 75
Arrays in HLSArrays in HLS
Top-Level IO Ports
• Top-level function arguments– All top-level function arguments have a default hardware port type
• When the array is an argument of the top-level function– The array/RAM is “off-chip”The array/RAM is “off-chip”
– The type of memory resource determines the top-level IO ports
– Arrays on the interface can be mapped & partitioned• E.g. partitioned into separate ports for each element in the array
• Default RAM resource– Dual port RAM if performance can be improved Dual port RAM if performance can be improved otherwise Single Port RAM
SynthesisSynthesis
foo_top DOUT0DIN0ADDR0
CE0WE0
DPRAMBvoid foo_top( int A[3*N] , int x){ L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }
++
Number of ports defined by the RAM resource
DIN1ADDR1
CE1WE1
DOUT1
ENG6530 RCS11- 76
Top-Level IO PortsTop-Level IO Ports
Schedule after an Array Optimization
• With the existing code & defaults– Port C is a dual port RAMdual port RAM
– Allows 2 reads per clock cyclesAllows 2 reads per clock cycles• IO behavior impacts performance
• With the C port partitioned into (4) separate ports– All reads and mults can occur in one cycleAll reads and mults can occur in one cycle
– If the timing allows• The additions can also occur in the same cycle• The write can be performed in the same cycles• Optionally the port reads and writes could be registered
+
RDx
*
+*
**
WRy
RDc
RDc
+
RDc
RDc
+
RDx
*
+
***
WRy
RDc
+
RDcRDcRDc
Note: It could have performed 2 reads in the original rolled design but there was no advantage since the
rolled loop forced a single read per cycle
loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;
ENG6530 RCS11- 77
Schedule after an Array OptimizationSchedule after an Array Optimization
Operators
• Operator sizes are defined by the type (*, _, -, /)– The variable type defines the size of the operator
• AutoESL will try to minimize the number of operators– By default AutoESL will seek to minimize area after constraints are satisfied
• User can set specific limits & targets for the resources used– Allocation can be controlledAllocation can be controlled
• An upper limit can be set on the number of operators or cores allocated for the design: This can be used to force sharing
• e.g limit the number of multipliers to 1 will force AutoESL to sharelimit the number of multipliers to 1 will force AutoESL to share
– Resources can be specified• The cores used to implement each operator can be specified• e.g. Implement each multiplier using a 2 stage pipelined core (hardware)Implement each multiplier using a 2 stage pipelined core (hardware)
33 22 11 00
Use 1 mult, but take 4 cycle even if it could be done in 1 cycle using 4 mults
Same 4 mult operations could be done with 2 pipelined mults (with allocation
limiting the mults to 2)33 11
22 00 ENG6530 RCS11- 78
OperatorsOperators
Input DataInternal Data
clk
X0X0
Loop 2/4Loop 2/4
Y0Y0
Loop 3/4Loop 3/4 Loop 4/4Loop 4/4
Output DataY0Y000 Loop 1/4Loop 1/4
Latency The number of cycles from input to output (final output of an array write) 14 cycles
Throughput= 14
Throughput The number of cycle between new input samples (in this example it must wait for
all operations to complete before it can read a new input)14 cycles
Data Rate The 1/throughout * clock frequency 10ns clock => 7.14 Mhz, ((1/10e9)*14)
Latency = 14
Tripcount = 4
Initiation Interval (II)
The number of cycles between new inputs to a pipeline (the same as throughput, but
this term is used with pipelines). Not shown in this example.
Trip count The number of iterations in a loop 4
Loop Latency The latency of the entire loop (divide by tripcount to get the latency for each loop iteration) 12 cycles
X1X1
Loop Latency= 12
ENG6530 RCS
You may have your own terminology: this is AutoESL’s
11- 79
AutoESL Terminology (Clock Cycles)AutoESL Terminology (Clock Cycles)
• Vivado HLS has a number of way to improve performanceimprove performance– Automatic (and default) optimizations
– Latency directives
– Pipelining to allow concurrent operations
• Vivado HLS support techniques to remove performance bottlenecks– Manipulating loops
– Partitioning and reshaping arrays
• Optimizations are performed using directivesusing directives– Let’s look first at how to apply and use directives in Vivado HLS
Improving Performance
Improving Performance 13- 80
Improving PerformanceImproving Performance
ENG6530 RCS
• Directives can be placed in the directives file
– The Tcl command is written into directives.tcl
– There is a directives.tcl file in each solution
• Each solution can have different directivesEach solution can have different directives
• Directives can be place into the C source
– Pragmas are added (and will remain) in the C source file
– Pragmas (#pragma) will be used by every Pragmas (#pragma) will be used by every solution which uses the codesolution which uses the code
Optimization Directives: Tcl or Pragma
Once applied the directive will be shown in the Directives tab (right-click to modify or
delete)
Improving Performance 13- 81
Optimization DirectivesOptimization Directives
ENG6530 RCS
• Select the New Solution Button• Optionally modify any of the settings
– Part, Clock Period, Uncertainty– Solution Name
• Copy existing directives– By default selected– Uncheck if do not want to copy– No need to copy pragmas, they are in the code
• Copy any existing custom commands in to the new script.tcl
– By default selected– Uncheck if do not want to copy
Copying Directives into New Solutions
Improving Performance 13- 82
Different Solutions (Directives)Different Solutions (Directives)
ENG6530 RCS
Functions & RTL Hierarchy
• Each function is translated into an RTL block– Verilog module, VHDL entity
– By default, each function is implemented using a common instance– Functions may be inlined may be inlined to dissolve their hierarchy
• Small functions may be automatically inlined
void A() { ..body A..}void B() { ..body B..}void C() {
B();}void D() {
B();}
void foo_top() {A(…);
C(…);D(…)
}
foo_top
A
CB
DB
Source CodeRTL hierarchy
Each function/block can be shared like any other component (add, sub, etc) provided it’s not in use at the same time
my_code.cmy_code.c
ENG6530 RCS11- 83
Functions & RTL Hierarchy
sumsub_func
sumsub_func
shift_func
AA
++ --
BB
++ --
>>2>>2>>1>>1
add_sub_pass
• Inlining can be used to remove function hierarchy
int sumsub_func (int *in1, int *in2, int *outSum, int *outSub) {*outSum = *in1 + *in2;*outSub = *in1 - *in2;}
int shift_func (int *in1, int *in2, int *outA, int *outB) { *outA = *in1 >> 1; *outB = *in2 >> 2;}
void add_sub_pass(int A, int B, int *C, int *D) { int apb, amb; int a2, b2; sumsub_func(&A,&B,&apb,&amb); sumsub_func(&apb,&amb,&a2,&b2); shift_func(&a2,&b2,C,D);
}
int sumsub_func (int *in1, int *in2, int *outSum, int *outSub) {*outSum = *in1 + *in2;*outSub = *in1 - *in2;}
int shift_func (int *in1, int *in2, int *outA, int *outB) { *outA = *in1 >> 1; *outB = *in2 >> 2;}
void add_sub_pass(int A, int B, int *C, int *D) { int apb, amb; int a2, b2; sumsub_func(&A,&B,&apb,&amb); sumsub_func(&apb,&amb,&a2,&b2); shift_func(&a2,&b2,C,D);
}
add_sub_pass
AA BB
AA B>>1B>>1
Zero AreaZero Area
Inlining allows optimization to be performed across function hierarchies
No Inlining Inlining
2 Adders2 Subtractors
2 Adders2 Subtractors
A+BA+B A-BA-B
Like RTL ungrouping, too much inlining can create a lot of logic and slow runtime
A+BA-B2A
A+BA-B2A
A+BA-B2B
A+BA-B2B
+ -
AA B>>1B>>1
21- 84Improving Area and Resources 21- 84
Function Inlining
• Vivado HLS performs some inlining automatically– This is performed on small logic functions if Vivado HLS determines area or
performance will benefit
• User Control– Functions can be specifically inlinedFunctions can be specifically inlined
• The function itself is inlined
– Optionally recursively down the hierarchy
– Optionally everything within a region can be inlined• Everything named region or a function or a loop
– Optionally inlining can be explicitly preventedOptionally inlining can be explicitly prevented• Turn inlining offTurn inlining off
• Inlining functions allows for greater optimization– Like ungrouping RTL hierarchies: optimization across boundaries
– Like ungrouping RTL hierarchies it can result in lots of operations & impact run time
Controlling Inlining
21- 85Improving Area and Resources 21- 85
Controlling Inlining
• Design Latency– The latency of the design is the number of cycle it takes to output the resultcycle it takes to output the result
• In this example the latency is 10 cycles
• Design Throughput– The throughput of the design is the
number of cycles between new inputs• By default (no concurrency) this is the
same as latency
• Next start/read is when this transaction ends
• In the absence of any concurrency– Latency is the same as throughput
Latency and Throughput – The Performance Factors
Improving Performance 13- 86
Latency and ThroughputLatency and Throughput
ENG6530 RCS
• Given a design with multiple functions– The code and dataflow are as shown
• Vivado HLS will schedule the design
• It can also automatically optimize the dataflow optimize the dataflow for throughput
Improving Throughput
Improving Performance 13- 87
Improving ThroughputImproving Throughput
ENG6530 RCS
• Dataflow Optimization– Can be used at the top-level function– Allows blocks of code to operate concurrentlyAllows blocks of code to operate concurrently
• The blocks can be functions or loops
• Dataflow allows loops to operate concurrently
– It places channels between the blocks places channels between the blocks to maintain the data rate
• For arrays the channels will include memory elements to buffer the samples
• For scalars the channel is a register with hand-shakes
• Dataflow optimization therefore has an area overheadhas an area overhead– Additional memory blocks are added to the design
Dataflow Optimization
Improving Performance 13- 88
Dataflow OptimizationDataflow Optimization
ENG6530 RCS
• Dataflow is set using a directive– Vivado HLS will seek to create the highest performance design
• Throughput of 1
Dataflow Optimization Commands
Improving Performance 13- 89
Dataflow Optimization CommandsDataflow Optimization Commands
ENG6530 RCS
• Dataflow Optimization– Dataflow optimization is “coarse graincoarse grain” pipelining at the function and loop
level
– Increases concurrency between functions and loops
– Only works on functions or loops at the top-level Only works on functions or loops at the top-level of the hierarchy• Cannot be used in sub-functions
• Function & Loop Pipelining– ““Fine grain” pipelining Fine grain” pipelining at the level of the operators (*, +, >>, etc.)
– Allows the operations inside the function or loop to operate in parallelAllows the operations inside the function or loop to operate in parallel
– Unrolls all sub-loops inside the function or loop being pipelined• Loops with variable bounds cannot be unrolled: This can prevent pipelining• Unrolling loops increases the number of operations and can increase memory
and run time
Pipelining: Dataflow, Functions & Loops
Improving Performance 13- 90
Dataflow versus PipeliningDataflow versus Pipelining
ENG6530 RCS
• There are 3 clock cycles before operation RD can occur again
– Throughput = 3 cycles
• There are 3 cycles before the 1st output is written
– Latency = 3 cycles
• The latency is the same
• The throughput is better
– Less cycles, higher throughput
Without Pipelining
Latency = 3 cycles
Throughput = 3 cycles
RDRD CMPCMP WRWR RDRD CMPCMP WRWR
With Pipelining
Latency = 3 cycles
Throughput = 1 cycle
RDRD CMPCMP WRWR
RDRD CMPCMP WRWR
void foo(...) { op_Read; op_Compute; op_Write;}
void foo(...) { op_Read; op_Compute; op_Write;}
RDRD
CMPCMP
WRWR
Improving Performance 13- 91
Function PipeliningFunction Pipelining
ENG6530 RCS
• The pipeline directive pipeline directive pipelines functions or loops– This example pipelines the function with an Initiation
Interval (II) of 2• The II is the same as the throughput but this term is used
exclusively with pipelines
• Omit the target II and Vivado HLS will Automatically
pipeline for the fastest possible design– Specifying a more accurate maximum may allow more
sharing (smaller area)
Pipelining Commands
RDRD CMPCMP WRWR
RDRD CMPCMP WRWR
Initiation Interval (or II)
Improving Performance 13- 92
Pipelining CommandsPipelining Commands
ENG6530 RCS
• Vivado HLS will attempt to unroll all loops nested below a PIPELINE directive– May not succeed for various reason May not succeed for various reason and/or may lead to unacceptable area
• Loops with variable bounds cannot be unrolledLoops with variable bounds cannot be unrolled• Unrolling Multi-level loop nests may create a lot of hardware Unrolling Multi-level loop nests may create a lot of hardware
– Pipelining the inner-most loop will result in best performance for area• Or next one (or two) out if inner-most is modest and fixed
e.g. Convolution algorithm
• Outer loops will keep the inner pipeline fed
Pipelining and Function/Loop Hierarchy
void foo(in1[ ][ ], in2[ ][ ], …) {#pragma AP PIPELINE … L1:for(i=1;i<N;i++) { L2:for(j=0;j<M;j++) { out[i][j] = in1[i][j] + in2[i][j]; } }}
void foo(in1[ ][ ], in2[ ][ ], …) {#pragma AP PIPELINE … L1:for(i=1;i<N;i++) { L2:for(j=0;j<M;j++) { out[i][j] = in1[i][j] + in2[i][j]; } }}
void foo(in1[ ][ ], in2[ ][ ], …) {… L1:for(i=1;i<N;i++) {#pragma AP PIPELINE L2:for(j=0;j<M;j++) { out[i][j] = in1[i][j] + in2[i][j]; } }}
void foo(in1[ ][ ], in2[ ][ ], …) {… L1:for(i=1;i<N;i++) {#pragma AP PIPELINE L2:for(j=0;j<M;j++) { out[i][j] = in1[i][j] + in2[i][j]; } }}
void foo(in1[ ][ ], in2[ ][ ], …) {… L1:for(i=1;i<N;i++) { L2:for(j=0;j<M;j++) {#pragma AP PIPELINE out[i][j] = in1[i][j] + in2[i][j]; } }}
void foo(in1[ ][ ], in2[ ][ ], …) {… L1:for(i=1;i<N;i++) { L2:for(j=0;j<M;j++) {#pragma AP PIPELINE out[i][j] = in1[i][j] + in2[i][j]; } }}
Unrolls L1 and L2 N*M adders, 3(N*M) accesses
Unrolls L2M adders, 3M accesses
1adder, 3 accesses
Improving Performance 13- 93
Pipelining and Function/Loop HierarchyPipelining and Function/Loop Hierarchy
ENG6530 RCS
Select loop “Add” in the directives pane and right-click
Unrolled loops allow greater option & exploration
Unrolled loops are likely to result in more hardware resources and higher area
Unrolling LoopsUnrolling Loops
ENG6530 RCSImproving Performance 13- 94
• Vivado HLS can automatically flatten nested loops– A faster approach than manually changing the code
• Flattening should be should be specified on the inner most loopinner most loop– It will be flattened into the loop above– The “off” option can prevent loops in the hierarchy from being flattened
Loop Flattening
void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }
L2: for (i=3;i>=0;i--) { L3: for (j=3;j>=0;j--) { [loop body l3 ] } }
L4: for (i=3;i>=0;i--) { [loop body l4 ] }
void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }
L2: for (i=3;i>=0;i--) { L3: for (j=3;j>=0;j--) { [loop body l3 ] } }
L4: for (i=3;i>=0;i--) { [loop body l4 ] }
11
22
33
44
x4
x4
x4
x4
36 transitions
void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }
L2: for (k=15,k>=0;k--) {
[loop body l3 ]}
L4: for (i=3;i>=0;i--) { [loop body l1 ] }
void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }
L2: for (k=15,k>=0;k--) {
[loop body l3 ]}
L4: for (i=3;i>=0;i--) { [loop body l1 ] }
11
22
44
x4
x16
x4
28 transitionsLoops will be flattened by default: use “off” to disable
Improving Performance 13- 95
Loop FlatteningLoop Flattening
ENG6530 RCS
C Validation and RTL Verification
• There are two steps to verifying the design– Pre-synthesis: C Validation
• Validate the algorithm is correct
– Post-synthesis: RTL Verification• Verify the RTL is correct
• C validation– A HUGE reason users want to use HLS
• Fast, free verification− Validate the algorithm is correct before
synthesis• Follow the test bench tips given over
• RTL Verification– AutoESL can co-simulate the RTL with
the original test bench
Test BenchTest Bench
Script withConstraintsScript withConstraints
………………
……………………………
………………
…VHDLVerilog
System C
VHDLVerilog
System C
AutoESLAutoESL
Constraints/ Directives
Constraints/ Directives
………………………………………
………………………C, C++, System
C
C, C++, System
C
RTL SynthesisRTL Synthesis
Validate C
Verify RTL
ENG6530 RCS11- 96
C Validation and RTL VerificationC Validation and RTL Verification
C Function Test Bench
• The test bench is the level above the function– The main() function is above the function to be synthesized
• Good Practices– The test bench should compare the results with golden data
• Automatically confirms any changes to the C are validated• Automatically verifies the RTL is correct
– The test bench should return a 0 if the self-checking is correct• Anything but a 0 (zero) will cause RTL verification to issue a FAIL message• Function main() should expect an integer return (non-void)
int main () { int ret=0; … ret = system("diff --brief -w output.dat output.golden.dat"); if (ret != 0) { printf("Test failed !!!\n"); ret=1; } else { printf("Test passed !\n"); } … return ret;}
ENG6530 RCS11- 97
C Function Test BenchC Function Test Bench
• The test bench should be in a separate file • Or excluded from synthesis
– The Macro __SYNTHESIS__ can be used to isolate code which will not be synthesized
• This macro is defined when Vivado HLS parses any code (-D__SYNTHESIS__)
// test.c#include <stdio.h>void test (int d[10]) { int acc = 0; int i; for (i=0;i<10;i++) { acc += d[i]; d[i] = acc; }}#ifndef __SYNTHESIS__int main () { int d[10], i; for (i=0;i<10;i++) { d[i] = i; } test(d); for (i=0;i<10;i++) { printf("%d %d\n", i, d[i]); } return 0;}#endif
// test.c#include <stdio.h>void test (int d[10]) { int acc = 0; int i; for (i=0;i<10;i++) { acc += d[i]; d[i] = acc; }}#ifndef __SYNTHESIS__int main () { int d[10], i; for (i=0;i<10;i++) { d[i] = i; } test(d); for (i=0;i<10;i++) { printf("%d %d\n", i, d[i]); } return 0;}#endif
Test benches I
Design to be synthesized
Test BenchNothing in this ifndef will be read by Vivado HLS
(will be read by gcc)
Using Vivado HLS 12 - 98
Test BenchesTest Benches
ENG6530 RCS
Determine or Create the top-level function
• Determine the top-level function for synthesis• If there are Multiple functions, they must be merged
– There can only be 1 top-level function for synthesis
int main () { ... func_A(a,b,*i1); func_B(c,*i1,*i2); func_C(*i2,ret)
return ret;}
func_Afunc_A
func_Bfunc_B
func_Cfunc_C
main.cmain.c
Given a case where functions func_A and func_B are to be
implemented in FPGA
Given a case where functions func_A and func_B are to be
implemented in FPGA
#include func_AB.hfunc_AB(a,b,c, *i1, *i2) {
... func_A(a,b,*i1); func_B(c,*i1,*i2); …
}
#include func_AB.hint main (a,b,c,d) {
... // func_A(a,b,i1); // func_B(c,i1,i2); func_AB (a,b,c, *i1, *i2); func_C(*i2,ret)
return ret;}
func_Afunc_A
func_Bfunc_B
func_ABfunc_AB
func_Cfunc_C
main.cmain.c
func_AB.cfunc_AB.c
Re-partition the design to create a new single top-level function inside
main()
Re-partition the design to create a new single top-level function inside
main()
Recommendation is to separate test bench and
design files*
ENG6530 RCS11- 99
Determine or Create Top Level FunctionDetermine or Create Top Level Function
Productivity– Verification
• Functional
• Architectural
– Abstraction• Datatypes
• Interface
• Classes
– Automation
Block level specification AND verification significantly reduced
Vivado HLS Benefits
RTL (Spec) RTL (Sim)
C (Spec/Sim) RTL (Sim)
ENG6530 RCS
Portability– Processors and FPGAs
– Technology migration
– Cost reduction
– Power reduction
Design and IP reuse
Vivado HLS Benefits
ENG6530 RCS
Permutability– Architecture Exploration
• Timing
Parallelization
Pipelining
• Resources
Sharing
– Better QoR
Rapid design exploration delivers QoR rivaling hand-coded RTL
Vivado HLS Benefits
ENG6530 RCS
Large Networking Company Video Up-scaler
Total effort for AutoESL design 2 days
AutoESL Runtime 8 seconds
Slice Registers 1651
Slice LUTs 1566
DSP48s 34
Achieved Throughput :720p -> 1080p @ 150 MHZ in Virtex 5
60 fps
Fast design time
Efficient RTL
High-definition video
11a- 103 ENG6530 RCS
Large Networking Company: Video Up-scalerLarge Networking Company: Video Up-scaler
Comprehensive C Support
• A Complete C Validation & Verification Environment– AutoESL supports complete bit-accurate validation of the C model
– AutoESL provides a productive C-RTL co-simulation verification solution
• AutoESL supports C, C++ and SystemC– Functions can be written in any version of C
– Wide support for coding constructs in all three variants of C• It’s easier to discuss what’s not supported than what is
• Modeling with bit-accuracy– Supports arbitrary precision types for all input languages
– Allowing the exact bit-widths to be modeled and synthesized
• Floating point support– Support for the use of float and double in the code
• Pointers and Streaming based applications– Multi-access pointer issues and streams
ENG6530 RCS11- 104
Comprehensive C SupportComprehensive C Support
C, C++ and SystemC Support
• The vast majority of C, C++ and SystemC is supported– Provided it is statically defined at compile time
– If it’s not defined until run time, it won’ be synthesizable
• Any of the three variants of C can be used– If C is used, AutoESL expects the file extensions to be .c
– For C++ and SystemC it expects file extensions .cpp
ENG6530 RCS11- 105
C, C++ and SystemC SupportC, C++ and SystemC Support
Summary
• In High-Level Synthesis (HLS)– C becomes RTL
– Operations in the code map to hardware resources
– Understand how constructs such as functions, loops and arrays are synthesized
• HLS design involves– Synthesize the initial design
– Analyze to see what limits the performance• User directives to change the default behaviors• Remove bottlenecks
– Analyze to see what limits the area• The types used define the size of operators• This can have an impact on what operations can fit in a clock cycle
• Use directives to shape the initial design to meet performance– Increase parallelism to improve performance
– Refine bit sizes and sharing to reduce areaENG6530 RCS11- 106
SummarySummary
ENG6530 RCS 107
ENG6530 Reconfigurable
Computing Systems
Celoxica Handel-CCeloxica Handel-C
ENG6530 RCS 108
Handel-CHandel-C
Programming language- enables compilation of programs into synchronous hardware
NOT Hardware Description LanguageNOT Hardware Description Language- it’s a programming language aimed at compiling high-level algorithms into gate-level hardware
Syntax (loosely) based on based on “C”
Handel-C is to hardware (gates) what “C” is to micro-assembly code
ENG6530 RCS 109
Handel-C: AdvantagesHandel-C: Advantages Hardware design produced is exactlyis exactly the hardware
specified in source program
Logic gates are assembly instructions of Handel-C system
No intermediate “interpreting” layer as in assembly language
targeting general purpose microprocessor
Easy to learn!
Design/re-design/optimize at software level!!!
ENG6530 RCS 110
Comparison with “C”Comparison with “C” SimilarSimilar:
- Programs inherently sequential- Similar control-flow constructs: if-then-else, switch, while, for, etc.
DissimilarDissimilar:- No malloc/ dynamic store allocation- No recursion (limited rec. in macros)- No nested procedures- No stdin/stdout - “Void main()”- variable width words- variable width words- PAR, etc.- PAR, etc.
Example 1 (sum)
Void main(){ unsigned int 16 sum; // variable width word
unsigned int 8 data;chanin input; // input/outputchanout output;
sum=0;do {
input?data; sum = sum + (0@data);
} while (data!=0);output!sum;
}
IMPORTANT – width!!
ENG6530 RCS 111
ENG6530 RCS 112
Main program structure Comments /* */ // Variables Constants Arrays Structures Conditional Execution
If statement Switch statement
Arithmetic, Relational, Relational Logic ops Iteration
For Loop While loop Do … While Loop
Supported Declaration & StatementsSupported Declaration & Statements
ENG6530 RCS 113
Handel-C describes Hardware!Handel-C describes Hardware!
No side effects in expressions i.e. statements like a = b*c++; are not supported
No floating point Floating point not directly supported by Handel-C but DK4/5
includes a library for fixed and floating point arithmetic
No run-time recursion Due to the absence of any kind of ‘call stack’ in hardware.
Limited standard library (i.e. no printf, fopen etc.) However, DK allows direct calls to external functions written in C/C+
+, and these could incorporate file I/O, user interaction, recursion, etc.
ENG6530 RCS 114
DeclarationsDeclarations Handel-C uses two kinds of objects:
1. Logic types2. Architecture types
Logic types specify variables The basic logic type is intint
Architecture types specify variables that require a particular sort of hardware architecture
ROMsROMs RAMSRAMS Channels (I/O Simulation)Channels (I/O Simulation) Interfaces (Connect to Board, i.e., busses)Interfaces (Connect to Board, i.e., busses)
ENG6530 RCS 115
VariablesVariables The range of an 8-bit signed integer is -128 to 127
Signed integers use 2’s complement representation The range of an 8-bit unsigned integer is 0 to 255
inclusive. Predetermined widths available
Char (8), short (16), long (32), int32 (32), int64 (64) Handel-C provides support for porting from conventional C by
allowing the types char, short and long Examples:
unsigned charchar w; // 8-bits (signed) shortshort y; // 16-bits unsigned longlong z; // 32-bits
ENG6530 RCS 116
VariablesVariables Handel-C has one basic type - integer
May be signed or unsigned
Can be any width, not limited to 8, 16, 32 etc.
VariablesVariables are mapped to hardware are mapped to hardware registersregisters..
void main(void){
unsigned 6 a;a=45;
}
1 0 1 1 0 1 = 0x2da =
LSBMSB
ENG6530 RCS 117
Features & Statements(contd.)Features & Statements(contd.) Variables /* Compiler will determine suitable width of vars */
int 10 x, y, z;int undefinedundefined a;a = x + y;
Arrays (declarations same as Conventional C) Index must be compile-time constant Access in parallel of array variables is allowed Implemented as seq. of registersImplemented as seq. of registers (expensive) int 6 x[7]; x[4] = 1; Unsigned int 6 x[4] [5] [6];
ENG6530 RCS 118
A Simple Program
ENG6530 RCS 119
Assignments Hardware
ENG6530 RCS 120
Handel-C Timing
ENG6530 RCS 121
Sequential Execution
ENG6530 RCS 122
Handel-C: ParallelismHandel-C: Parallelism Handel-C blocks are by defaultdefault sequential par{…} executes statements in parallel par block completes when all statements complete
Time for block Time for block is time for longest statement Can nest sequential blocks in par blocks
// 3 Clock Cycles {
a=1;b=2;c=3;
}
Sequential BlockParallel Block
// 1 Clock Cycle par{
a=1;b=2;c=3;
}
ENG6530 RCS 123
TimingTiming
ENG6530 RCS 124
Additional Features & StatementsAdditional Features & Statements
Concurrency...par{
{}…{ …}
}
ENG6530 RCS 125
Par Completion: Soln
ENG6530 RCS 126
MoreMore Parallelism Parallelism Example – array initialisation Sequential version takes 20 clock cycles20 clock cycles
for() loop has 1 cycle overhead for increment Parallel version takes 1 clock cycle1 clock cycle
Replicated par() builds hardware to execute all 20 iterations in a single cycle
Allows trade-off between hardware size and performance
for(i=0;i<10;i++){ array[i]=0;}
Sequential code Parallel code
par(i=0;i<10;i++){ array[i]=0;}
ENG6530 RCS 127
While Loops
ENG6530 RCS 128
Example: Conditional Operators
ENG6530 RCS 129
Arrays, RAMs and ROMsArrays, RAMs and ROMs Handel-C easily allows designers to declare arrays of registers, ROMs
and RAMs. An array of registers array of registers is declared like an array in C. All the registers
may be accessed in parallel
This array can be turned into a ROM or RAMcan be turned into a ROM or RAM by putting the appropriate keyword in front. Only one location may then be accessed per clock cycle
unsigned 8 Data[256];
ram unsigned 8 Data1[256];rom unsigned 8 Data2[256];
// Array & RAM access example {
A = Data2[1]; // Read array, RAM or ROMData1[11] = 3; // Write to Array or RAM
}
ENG6530 RCS 130
Additional Features & StatementsAdditional Features & Statements
Using external and internal RAM / ROMRAMs and ROMs may only have one entrymay only have one entry
accessed in any clock cycle
More efficientMore efficient to implement in terms of h/w
resources than arrays & allow a non-constant
index
Handel-C compiler can infercompiler can infer width, type and
#entries.
ENG6530 RCS 131
RAM Access from Handel-CRAM Access from Handel-C
Handel-C allows you access to a number of different types of RAM:
1)1) Distributed RAMDistributed RAM, which is implemented in look-up tables in the logic blocks of FPGAs.
2)2) Block RAMBlock RAM, which is available on certain chips.
3)3) Off-chip RAMOff-chip RAM
ENG6530 RCS 132
(1) Distributed RAM(1) Distributed RAM
Internal RAM / ROM
ram unsigned int 8 myram[256];rom unsigned int 8 program[] = {1,2,3,4};unsigned char i;i = 3;myram[i] = 25;for (i = 0; i < 4; i++)
stdout ! program[i];
ENG6530 RCS 133
(2) Block RAM(2) Block RAM Block RAM (Single Port)
ram unsigned 8 MyRam[512] with {block = 1}; Block RAM (Dual Port)
mpram
{
ram unsigned 8 ReadWriteA[512];
ram unsigned 8 ReadWriteB[512]’
}
MyRam with {block=1};
ENG6530 RCS 134
RAM Access: Use RegistersRAM Access: Use Registers To minimize the logic for external, distributed and
block RAM accesses, it is best to use registers directly for address and dataRead – supply the data and address from a
register directly (no expression)
MyRam[MyAddressReg] = MyDataReg;Write – supply the address directly from a
register and read the data directly into a register
MyDataReg = MyRam[MyAddressReg];
ENG6530 RCS 135
(3) OffChip RAM(3) OffChip RAM External RAM / ROM
ram unsigned int 4 ExtRAM[8] with {offchip = 1,
data = {"P01", "P02", "P03", "P04"},
addr = {"P05", "P06", "P07"},
we = {"P08"}, oe = {"P09"}, cs = {"P10"} };
rom unsigned int 4 ExtROM[8] with {offchip = 1,
data = {"P01", "P02", "P03", "P04"},
addr = {"P05", "P06", "P07"},
we = {}, oe = {"P09"}, cs = {"P10"}
};
ENG6530 RCS 136
Synthesizable ANSI-C for hardwareSynthesizable ANSI-C for hardware
ENG6530 RCS 137
Porting “C” to Handel-C Porting “C” to Handel-C
1. Decide how software maps to hardware platform2. Partition algorithm between multiple FPGAs3. Port C to Handel-C & use simulator to check correctness4. Modify code to take advantage of extra operators in
Handel-C - simulate to ensure correctness5. Add fine-grain parallelism through PAR & parallel
assignments or parallelize algorithm - simulate6. Add hardware interfaces for target architecture & map
simulator channels communications onto these interfaces - simulate
7. Use FPGA place & route tools to generate FPGA images
ENG6530 RCS 138
Summary: Handel-CSummary: Handel-C C-based programming language for digital system design. One clock-cycle per statement. Explicit parallelism. Compiler generates hardware design from Handel-C
source. Additions:
support for parallelismparallelism (PAR Statement) channels for communicationscommunications between parallel processes operators for detailed controldetailed control of hardware constructs for RAM, ROMRAM, ROM, interfacing, etc.
ENG6530 RCS 139