ENG3050/ENG6530 Reconfigurable
Computing Systems
ReviewReview
& Final Exam& Final Exam
2 RCS - Winter 2014 2
Course Objectives
Achieves the following goals:1. Gives an overview of the traditional Von Neumann
Computer Architecture, its specifications, design and implementations and main drawbacks. Techniques to improve the performance.
2. Teaches you the internal structure of Programmable Logic in general and Field Programmable Gate Arrays in particular.
3. Teaches you how digital circuits are designed today using advanced CAD tools and HDLs and high level languages.
4. Teaches you the basic concepts of Reconfigurable Computing systems (Hardware/Software co-design)
5. Teaches you when/how to apply Reconfigurable Computing Concepts to design efficient, reliable, robust systems (DSP).
6. Understand the concept of Run Time Reconfiguration.
ENG6530 RCS 3
Reconfigurable Computing: DefinitionReconfigurable Computing: Definition
Reconfigurable Computing (RC) is a computing paradigm
where programmable logic devices are used to accelerate accelerate computationscomputations or applicationsapplications by exploiting parallelism at different levels (bit, instruction level, architectural)
in which Algorithms are implemented as a temporallytemporally and spatiallyspatially ordered set of very complex tasks.
What is meant by temporal and spatial implementations?What is meant by temporal and spatial implementations?
Spatial vs. Temporal Computing
ENG6530 RCS 4
Ax2 + Bx + c (Ax + B)x + C
Spatial (ASIC) Temporal (Processor)
5
Issues in Configurable DesignIssues in Configurable Design
1. Reconfigurable Hardware Architecture (FPGA) Choice and granularity of computational elements Issues related to performance, area and power consumption
2. Design Entry Techniques Low Level (VHDL) High Level (ESL e.g. Handel-C)
3. Support of efficient CAD tools High Level Synthesis, Logic Optimization, Mapping, Place & Route
4. Coupling Approaches Tightly coupled vs. loosely coupled
5. Area versus Performance Serial, semi-parallel, parallel Floating Point, Fixed Point
6. Reconfiguration time and rate Static versus dynamic reconfiguration (area, performance, approaches)
Methods for executing algorithmsMethods for executing algorithms
Advantages:•very high performance and efficientDisadvantages:•not flexible (can’t be altered after fabrication)• expensive
Hardware(Application Specific Integrated Circuits)
Software-programmed processors
Advantages:•software is very flexible to changeDisadvantages:•performance can suffer if clock is not fast•fixed instruction set by hardware
Reconfigurablecomputing
Advantages:•fills the gap fills the gap between hardware between hardware and software and software •much higher performance than software•higher level of flexibility than hardware
ENG6530 RCS 7
The Von Neumann ComputerThe Von Neumann Computer
Advantage: Flexibility: any well coded program can be executed
Drawbacks Speed efficiency: Not efficient, due to the sequential
program execution (temporal resource sharing). Resource efficiency: Only one part of the hardware
resources is required for the execution of an instruction. The rest remains idle.
Memory access: Memories are about 10 time slower than the processor
How to compensate for deficiencies?
ENG6530 RCS 8
Improving Performance of VN (GPPs)Improving Performance of VN (GPPs)1. Technology Scaling
Improve performance (increase clock frequency!)
2. Improving Instruction Set of Processor3. Application Specific Processors (DSP)4. Use of Hierarchical Memory System
Cache can enhance speed
5. Multiplicity of Functional Units (H/W) Adders/Multipliers/Dividers (CDC-6600)
6. Pipelining within CPU (H/W) A four stage pipeline stage (IF/ID/OF/EX)
7. Overlap CPU & I/O Operations (H/W) DMA (Direct Memory Access) can be used to enhance performance
8. Time Sharing (SW) Multi-tasking assigns fixed or variable time slices to multiple programs
9. Parallelism & Multithreading (S/W) (H/W) Compilers/Multi-core systems
ENG6530 RCS 9
Exploiting ParallelismExploiting Parallelism
• Bit level Bit level parallelism: 1970 to ~1985– 4 bits, 8 bit, 16 bit, 32 bit microprocessors
• Instruction level Instruction level parallelism (ILP): ~1985 through today
– Pipelining
– Superscalar
– Limits to benefits of ILP?
• Process Level Process Level or Thread level parallelism; mainstream for general purpose computing?
– Servers are parallel
– High-end Desktop dual processor PC
ENG6530 RCS 10
PipeliningPipeliningExploits parallelism at the instruction level.Pipelining is an implementation technique in
which multiple instructions are overlapped in execution.Today pipelining is keypipelining is key to making processors fast.Pipelining is not only used in General Purpose
processors but can also be used in hardware hardware acceleratorsaccelerators.
ENG6530 RCS 11
ENG6530 RCS 12
Speed Up Speed Up
stages pipe ofNumber
nsinstructiobetween Time nsinstructiobetween Time
ned)(nonpipeli
)(pipelined
If the stages are perfectly balanced, then the time If the stages are perfectly balanced, then the time between instructions on the pipelined processor – between instructions on the pipelined processor – assuming ideal conditions – is equal to:assuming ideal conditions – is equal to:
Under ideal conditionsUnder ideal conditions and with a large number of and with a large number of instructions, the speedup from pipelining is instructions, the speedup from pipelining is approximately equal to the number of pipe stage, i.e., a approximately equal to the number of pipe stage, i.e., a five stage pipeline is nearly five times faster.five stage pipeline is nearly five times faster.
Pipelining Pipelining improves performance by increasing improves performance by increasing instruction throughputinstruction throughput, as opposed to decreasing the , as opposed to decreasing the execution time of an individual instruction.execution time of an individual instruction.
Parallel ProcessingParallel Processing Using more than one processor to solve a
problem Idea is that n processorsn processors operating simultaneously can
achieve the result n times faster.
Motives Diminishing returns from ILP
Limited ILP in programs ILP increasingly expensive to exploit
Fault tolerance Large amount of memory available
ENG6530 RCS 13
ENG6530 RCS 14
Flynn’s TaxonomyFlynn’s TaxonomyInstructions
Single (SI) Multiple (MI)D
ata
Mu
ltip
le (
MD
)SISD
Single-threaded process
MISD
Pipeline architecture
SIMD
Vector Processing
MIMD
Multi-threaded Programming
Sin
gle
(S
D)
ENG6530 RCS 15
Speedup factor
S(n) = Execution time on a single processor
Execution time on a multiprocessor with n processors
S(n) gives increase in speed by using a multiprocessor
Speedup factor can also be cast in terms of computational steps
S(n) = Number of steps using one processor
Number of parallel steps using n processors
Maximum speedup is n with n processors (linear speedup) - this theoretical limit is not always achieved. WHY?WHY?
ENG6530 RCS 16
Amdahl’s Law• Amdahl's law, is used to find the maximum expected improvement
to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical predict the theoretical maximum maximum speedupspeedup using `n’ processors.
• The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program.
ENG6530 RCS 17
Benefits of Reconfigurable SystemBenefits of Reconfigurable System
A trade-off betweentrade-off between traditional hardware (performance) and software (flexibility) Hardware-like performance with software-like flexibility
Hardware can be modified on-the-flymodified on-the-fly Can accommodate any architecture (resource management) Changes can be made in the field
Orders of magnitude performanceOrders of magnitude performance improvements over Software traditional systems
Programming can be achieved at different levels of different levels of abstractionabstraction HDL (VHDL/Verilog) C/C++/Matlab
Remember!
ENG6530 RCS 18
Pro
gram
ma
bleP
rogra
mm
able
Loo
kup T
ables (LU
Ts)
Loo
kup T
ables (LU
Ts)
Pro
gram
ma
bleP
rogra
mm
able
routin
g structu
rero
uting stru
cture
Main bottleneck with state-of-the-art fine grain FPGAs is the
routing enabled by pass transistors!
Remember!
ENG6530 RCS 19
Pro
gram
ma
bleP
rogra
mm
able
Loo
kup T
ables (LU
Ts)
Loo
kup T
ables (LU
Ts)
Pro
gram
ma
bleP
rogra
mm
able
routin
g structu
rero
uting stru
cture
LUTxyz f
...
fSRAM
x
y
z...
001
0
...1
Look-up-tables are flexible but require lots of configuration and suffer from power dissipation!
ENG6530 RCS 20
Evolution of the FPGA Early FPGAs were used mainly for “glue logic” between other
components Simple CLBs, small number of inputs Focus was on implementing “random” logic efficiently
As capacities grew, other applications emerged FPGAs as alternative to custom IC’s for entire applications Emulation of ASICs. Computing with FPGAs
FPGAs have changed to meet new application demands Carry chains, better support for multi-bit operations Integrated memories, such as the Block RAMsBlock RAMs. Specialized units, such as multipliersmultipliers, to implement
functions that are slow/inefficient in CLBs Clock Managers Clock Managers to control the Frequency of Operation Newer devices incorporate entire CPUsentire CPUs: Xilinx Virtex II Pro
has 1-4 Power PC CPUs
ENG6530 RCS 21
CGRAsCGRAs Try to overcome the disadvantage of FPGA-based
computing solutions by providing multiple-bit wide datapaths and complex operators instead of bit-level configurability.
Wide datapath allows the efficient implementation of complex operators in silicon.
Routing overhead generated by having to compose complex operators from bit-level processing units is avoided.
Less processing elements (LUTS vs. PEs) results in less time to configure and reconfigure the devices
ENG6530 RCS 22
Design Entry
Logic Logic OptimizationOptimization
Synthesis
Mapping to k-LUT
Packing LUTs to CLBs
Placement
Routing Configure an FPGA
Simulation
CAD for FPGAsCAD for FPGAs
Quine-McCluskey Method: Quine-McCluskey Method: DetailsDetails
1.1. GenerateGenerate all Prime Implicants2.2. ConstructConstruct the Prime Implicant Table3.3. ReduceReduce the Prime Implicant Table:
I. Check for essential implicants (rows) and remove them;II. Check for row dominance and remove all dominateddominated rows;III. Check for column dominance and remove all dominatingdominating columns;
4.4. RepeatRepeat I, II, III if there is any removal occurs.5.5. Final Solution:Final Solution:
– If no rows/columns left, an optimal solution found;If no rows/columns left, an optimal solution found;– Otherwise, this instance is called cyclic.Otherwise, this instance is called cyclic.
• Use a heuristicUse a heuristic• Use a Branch & Bound TechniqueUse a Branch & Bound Technique• Use Petrick’s MethodUse Petrick’s Method
ENG6530 RCS 23
ENG6530 RCS 24
FPGA Placement Problem• Input – A technology mapped netlist of Configurable
Logic Blocks (CLB) realizing a given circuit.
• Output – CLB netlist placed in a two dimensional array of slots such that total wirelength is minimized.
CLB Netlist
i1 i2 i3 i4
f1 f2
1 2 3
4 5 6 7 8
9 10
FPGA
PlacementPlacement
i1 i2 i3
i4
f2
f1
1
2
3
4
56
7
8
9
10
ENG6530 RCS 25
Global vs. Detailed RoutingGlobal vs. Detailed Routing
Global routing
LB LB LB
SB SB
LB LB LB
SB SB
LB LB LB
SB
SB
LB LB LB
SB SB
LB LB LB
SB SB
LB LB LB
SB
SB
Detailed routing
ENG6530 RCS 26
VHDL for SynthesisVHDL for Synthesis
VHDL for VHDL for SimulationSimulation
VHDL for VHDL for SynthesisSynthesis
Only a subset of the VHDL language is synthesizable The VHDL subset that is synthesizable is tool specific!
– Do not expect your VHDL description to be synthesizable with another tool.
Understand what is meant for simulation versus what is meant for producing hardware (i.e. synthesis)
ENG6530 RCS 27
VHDL Hardware VHDL Hardware CorrespondenceCorrespondence
ENG6530 RCS 28
Abstraction: AdvantagesAbstraction: Advantages
FPGA Tool Flow with ESLFPGA Tool Flow with ESL
HDL
Netlist
Bitfile
Processor FPGA
RT Synthesis
Physical Design
Technology Mapping
Placement
Routing
High-level Synthesis
C/C++, Java, etc.
29
High-Level Synthesis: HLS
• High-Level Synthesis– Creates an RTL implementation from C
level source code
– Extracts control and dataflow from Extracts control and dataflow from the source codethe source code
– Implements the design Implements the design based on defaults and user applied directivesuser applied directives
• Many implementation are possible from the same source description– Smaller designs, faster designs, optimal
designs
– Enables manualmanual design exploration
ENG6530 RCS11- 30
AutoESL or Vivado HLSAutoESL or Vivado HLS
Script withConstraintsScript withConstraints
RTL Wrapper
RTL Wrapper
………………………………………
………………………VHDLVerilog
System C
VHDLVerilog
System C
AutoESLAutoESL
Test benchTest
bench Constraints/ Directives
Constraints/ Directives
………………
………………
………………
………………C, C++,
SystemC
C, C++, System
C
RTL SimulationRTL Simulation RTL SynthesisRTL Synthesis
31
Contrasting ESLsContrasting ESLs
Handel-C VIVA Mitrion C
Hardware Software
HDL Impulse-C
VIVADO HLSSystemC
Explicit Par StatementsMemory StatementsChannels, …
Pure C/C++ statementswith Pragmas inserted
ENG6530 RCS 32
Different levels of couplingDifferent levels of coupling
FU
Workstation
Coprocessor
CPU Memory Caches
I/O Interfac
e
Standalone Processing Unit
Attached Processing Unit
Tightly CoupledTightly Coupled
Loosely CoupledLoosely Coupled
ENG6530 RCS 33
HW/SW Co-designHW/SW Co-design
SW__________________
SW__________________
SW__________________
HW__________________
SW__________________
SW__________________
ProcessorProcessor ProcessorASIC/FPGA
Critical Regions
ProfilerProfiler Benefits Speedups of 2X to
10X typical Far more potential
than dynamic SW optimizations (1.2x)
Energy reductions of 25% to 95% typical
Time Energy
SW OnlyHW/ SW
Time Energy
SW Only
ProcessorProcessor
ENG6530 RCS 34
Processor Specialization: (ASIP)
Gains!
ENG6530 RCS 35
Exploiting Parallelism
Reconfigurable Instruction Set ProcessorsReconfigurable Instruction Set Processors
Duplicated instruction decode logic (2 simmetrical data- channels)
Duplicated commonly used function Units (Alu and Shifter)
All others function units are shared (DSP operations, Memory handler)
A tightly coupled pipelined configurable Gate Array
ENG6530 RCS 37
XPRES CompilerXPRES Compiler
ENG6530 RCS 38
ReconfigurabilityReconfigurability Reconfiguration is either staticstatic (execution is interrupted), semi-staticsemi-static (also called
time-shared) or dynamicdynamic (in parallel with execution):
Static configurationStatic configuration involves hardware changes at the slow rate of days/weeks,
Semi-StaticSemi-Static (Global RTR): If an application can be pipelined, it might be possible to implement each phase in
sequence on the reconfigurable hardware. Temporal Partitioning Techniques.
Dynamic reconfiguration (Local RTR)Dynamic reconfiguration (Local RTR): The hardware reconfigures itself on the fly as it executes a task, refining its
own programming for improved performance.
Partial Reconfiguration Technology and Benefits
Partial Reconfiguration enables: System Flexibility
Perform more functions while maintaining communication links
Size and Cost Reduction Time-multiplex the hardware
to require a smaller FPGA
Power Reduction Shut down power-hungry tasks
when not needed
Fault Tolerant/self-repairing Systems Adaptive System
39 ENG6530 RCS
Power Reduction Techniques with PR Many techniques can be employed to reduce power
Swap out Swap out high-power functions for low-power functions low-power functions when maximum performance is not required
Swap out Swap out high-power I/O standards for lower-power I/O lower-power I/O when specific characteristics are not needed
Swap out Swap out black boxes for inactive regionsinactive regions Time-multiplexing functions Time-multiplexing functions will reduce power by reducing
amount of configured logic
40 ENG6530 RCS
Con
trol
ler
(Mic
rob
laze
)
ICAP
Fla
sh c
ontr
olle
r
Current PR Design Flow
• Steps– Partition the system into modules– Define static modules and reconfigurable
modules – Decide the number of PR regions (PRRs)– Decide PRR sizes, shapes and locations – Map modules to PRRs– Define PRR interfaces, instantiate slice
macros for PRR interfaces
• Optimization problems– Design partitioning– Number of PRRs– PRR sizes, shapes and locations – Mapping PRMs to PRRs– Type and placement of PRR interfaces
Module AModule C
Module B
Static modules Reconfigurable Modules (PRMs)
12
FP
GA
# of PRRs
PRR 1
PRR 2
Sta
tic
regi
on
Static modules
Modules: A and B
Modules: C
Des
ign
part
itio
ning
Des
ign
floo
rpla
nnin
g an
d bu
dget
ing
ENG6530 RCS 41
Reconfigurable Operating Systems Modern FPGA can be partially dynamically reconfigured. This
feature adds tremendous flexibility adds tremendous flexibility to the Reconfigurable Computing (RC) Field but also introduces challenges. also introduces challenges.
Reconfigurable Operating Systems tend to: Ease applications development, Higher lever of abstraction Ease application verification and maintenance.
42 ENG6530 RCS
Design Space ExplorationDesign Space Exploration
What?: What?: Rapid exploration of various architectural solutions to be implemented on reconfigurable architectures in order to select the select the most efficient architecture most efficient architecture for one or several applications
When?: When?: Take place before architectural synthesis before architectural synthesis (algorithmic specification with high level abstraction language)
How?How?: Estimations are based on a functional architecture model (generic, technology-independenttechnology-independent)
Iterative exploration flowIterative exploration flow to progressively refine the architecture definition, from a coarse model to a dedicated model
ENG6530 - Design Exploration 43
ENG6530 - Design Exploration 44
Framework OverviewFramework Overview
Core Library
Communication Sub-System
Communication Sub-System
Embedded Processors
Dedicated Hardware
Accelerator
Point to Point
Common Bus
User Constraints
User ApplicationModeling Tool
Designer
Mathematical &
Meta-Heuristics Search the design space
Evaluation of the Generation Architecture
Generated ArchitectureEvaluation Results
ASIC
FPGA
CGRA
Implementation Platform
Proposed Architecture
ENG6530 RCS 45
Final Exam Final Exam
Undergraduate Final Exam (50%)Undergraduate Final Exam (50%) Graduate Final Exam (40%)Graduate Final Exam (40%)
Comprehensive (includes most topics) Short questions that rely on understanding the topics. 2 hours long 6-8 questions Only a few formulas are introduced in the course (memorize them)
46
Top Related