Download - ENG3050/ENG6530 Reconfigurable Computing Systems Review Review & Final Exam.

ENG3050/ENG6530 Reconfigurable

Computing Systems

ReviewReview

& Final Exam& Final Exam

2 RCS - Winter 2014 2

Course Objectives

Achieves the following goals:1. Gives an overview of the traditional Von Neumann

Computer Architecture, its specifications, design and implementations and main drawbacks. Techniques to improve the performance.

2. Teaches you the internal structure of Programmable Logic in general and Field Programmable Gate Arrays in particular.

3. Teaches you how digital circuits are designed today using advanced CAD tools and HDLs and high level languages.

4. Teaches you the basic concepts of Reconfigurable Computing systems (Hardware/Software co-design)

5. Teaches you when/how to apply Reconfigurable Computing Concepts to design efficient, reliable, robust systems (DSP).

6. Understand the concept of Run Time Reconfiguration.

ENG6530 RCS 3

Reconfigurable Computing: DefinitionReconfigurable Computing: Definition

Reconfigurable Computing (RC) is a computing paradigm

where programmable logic devices are used to accelerate accelerate computationscomputations or applicationsapplications by exploiting parallelism at different levels (bit, instruction level, architectural)

in which Algorithms are implemented as a temporallytemporally and spatiallyspatially ordered set of very complex tasks.

What is meant by temporal and spatial implementations?What is meant by temporal and spatial implementations?

Spatial vs. Temporal Computing

ENG6530 RCS 4

Ax2 + Bx + c (Ax + B)x + C

Spatial (ASIC) Temporal (Processor)

5

Issues in Configurable DesignIssues in Configurable Design

1. Reconfigurable Hardware Architecture (FPGA) Choice and granularity of computational elements Issues related to performance, area and power consumption

2. Design Entry Techniques Low Level (VHDL) High Level (ESL e.g. Handel-C)

3. Support of efficient CAD tools High Level Synthesis, Logic Optimization, Mapping, Place & Route

4. Coupling Approaches Tightly coupled vs. loosely coupled

5. Area versus Performance Serial, semi-parallel, parallel Floating Point, Fixed Point

6. Reconfiguration time and rate Static versus dynamic reconfiguration (area, performance, approaches)

Methods for executing algorithmsMethods for executing algorithms

Advantages:•very high performance and efficientDisadvantages:•not flexible (can’t be altered after fabrication)• expensive

Hardware(Application Specific Integrated Circuits)

Software-programmed processors

Advantages:•software is very flexible to changeDisadvantages:•performance can suffer if clock is not fast•fixed instruction set by hardware

Reconfigurablecomputing

Advantages:•fills the gap fills the gap between hardware between hardware and software and software •much higher performance than software•higher level of flexibility than hardware

ENG6530 RCS 7

The Von Neumann ComputerThe Von Neumann Computer

Advantage: Flexibility: any well coded program can be executed

Drawbacks Speed efficiency: Not efficient, due to the sequential

program execution (temporal resource sharing). Resource efficiency: Only one part of the hardware

resources is required for the execution of an instruction. The rest remains idle.

Memory access: Memories are about 10 time slower than the processor

How to compensate for deficiencies?

ENG6530 RCS 8

Improving Performance of VN (GPPs)Improving Performance of VN (GPPs)1. Technology Scaling

Improve performance (increase clock frequency!)

2. Improving Instruction Set of Processor3. Application Specific Processors (DSP)4. Use of Hierarchical Memory System

Cache can enhance speed

5. Multiplicity of Functional Units (H/W) Adders/Multipliers/Dividers (CDC-6600)

6. Pipelining within CPU (H/W) A four stage pipeline stage (IF/ID/OF/EX)

7. Overlap CPU & I/O Operations (H/W) DMA (Direct Memory Access) can be used to enhance performance

8. Time Sharing (SW) Multi-tasking assigns fixed or variable time slices to multiple programs

9. Parallelism & Multithreading (S/W) (H/W) Compilers/Multi-core systems

ENG6530 RCS 9

Exploiting ParallelismExploiting Parallelism

• Bit level Bit level parallelism: 1970 to ~1985– 4 bits, 8 bit, 16 bit, 32 bit microprocessors

• Instruction level Instruction level parallelism (ILP): ~1985 through today

– Pipelining

– Superscalar

– Limits to benefits of ILP?

• Process Level Process Level or Thread level parallelism; mainstream for general purpose computing?

– Servers are parallel

– High-end Desktop dual processor PC

ENG6530 RCS 10

PipeliningPipeliningExploits parallelism at the instruction level.Pipelining is an implementation technique in

which multiple instructions are overlapped in execution.Today pipelining is keypipelining is key to making processors fast.Pipelining is not only used in General Purpose

processors but can also be used in hardware hardware acceleratorsaccelerators.

ENG6530 RCS 11

ENG6530 RCS 12

Speed Up Speed Up

stages pipe ofNumber

nsinstructiobetween Time nsinstructiobetween Time

ned)(nonpipeli

)(pipelined

If the stages are perfectly balanced, then the time If the stages are perfectly balanced, then the time between instructions on the pipelined processor – between instructions on the pipelined processor – assuming ideal conditions – is equal to:assuming ideal conditions – is equal to:

Under ideal conditionsUnder ideal conditions and with a large number of and with a large number of instructions, the speedup from pipelining is instructions, the speedup from pipelining is approximately equal to the number of pipe stage, i.e., a approximately equal to the number of pipe stage, i.e., a five stage pipeline is nearly five times faster.five stage pipeline is nearly five times faster.

Pipelining Pipelining improves performance by increasing improves performance by increasing instruction throughputinstruction throughput, as opposed to decreasing the , as opposed to decreasing the execution time of an individual instruction.execution time of an individual instruction.

Parallel ProcessingParallel Processing Using more than one processor to solve a

problem Idea is that n processorsn processors operating simultaneously can

achieve the result n times faster.

Motives Diminishing returns from ILP

Limited ILP in programs ILP increasingly expensive to exploit

Fault tolerance Large amount of memory available

ENG6530 RCS 13

ENG6530 RCS 14

Flynn’s TaxonomyFlynn’s TaxonomyInstructions

Single (SI) Multiple (MI)D

ata

Mu

ltip

le (

MD

)SISD

Single-threaded process

MISD

Pipeline architecture

SIMD

Vector Processing

MIMD

Multi-threaded Programming

Sin

gle

(S

D)

ENG6530 RCS 15

Speedup factor

S(n) = Execution time on a single processor

Execution time on a multiprocessor with n processors

S(n) gives increase in speed by using a multiprocessor

Speedup factor can also be cast in terms of computational steps

S(n) = Number of steps using one processor

Number of parallel steps using n processors

Maximum speedup is n with n processors (linear speedup) - this theoretical limit is not always achieved. WHY?WHY?

ENG6530 RCS 16

Amdahl’s Law• Amdahl's law, is used to find the maximum expected improvement

to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical predict the theoretical maximum maximum speedupspeedup using `n’ processors.

• The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program.

http://en.wikipedia.org/wiki/Parallel_computing

http://en.wikipedia.org/wiki/Speedup

ENG6530 RCS 17

Benefits of Reconfigurable SystemBenefits of Reconfigurable System

A trade-off betweentrade-off between traditional hardware (performance) and software (flexibility) Hardware-like performance with software-like flexibility

Hardware can be modified on-the-flymodified on-the-fly Can accommodate any architecture (resource management) Changes can be made in the field

Orders of magnitude performanceOrders of magnitude performance improvements over Software traditional systems

Programming can be achieved at different levels of different levels of abstractionabstraction HDL (VHDL/Verilog) C/C++/Matlab

Remember!

ENG6530 RCS 18

Pro

gram

ma

bleP

rogra

mm

able

Loo

kup T

ables (LU

Ts)

Loo

kup T

ables (LU

Ts)

Pro

gram

ma

bleP

rogra

mm

able

routin

g structu

rero

uting stru

cture

Main bottleneck with state-of-the-art fine grain FPGAs is the

routing enabled by pass transistors!

Remember!

ENG6530 RCS 19

Pro

gram

ma

bleP

rogra

mm

able

Loo

kup T

ables (LU

Ts)

Loo

kup T

ables (LU

Ts)

Pro

gram

ma

bleP

rogra

mm

able

routin

g structu

rero

uting stru

cture

LUTxyz f

...

fSRAM

x

y

z...

001

0

...1

Look-up-tables are flexible but require lots of configuration and suffer from power dissipation!

ENG6530 RCS 20

Evolution of the FPGA Early FPGAs were used mainly for “glue logic” between other

components Simple CLBs, small number of inputs Focus was on implementing “random” logic efficiently

As capacities grew, other applications emerged FPGAs as alternative to custom IC’s for entire applications Emulation of ASICs. Computing with FPGAs

FPGAs have changed to meet new application demands Carry chains, better support for multi-bit operations Integrated memories, such as the Block RAMsBlock RAMs. Specialized units, such as multipliersmultipliers, to implement

functions that are slow/inefficient in CLBs Clock Managers Clock Managers to control the Frequency of Operation Newer devices incorporate entire CPUsentire CPUs: Xilinx Virtex II Pro

has 1-4 Power PC CPUs

ENG6530 RCS 21

CGRAsCGRAs Try to overcome the disadvantage of FPGA-based

computing solutions by providing multiple-bit wide datapaths and complex operators instead of bit-level configurability.

Wide datapath allows the efficient implementation of complex operators in silicon.

Routing overhead generated by having to compose complex operators from bit-level processing units is avoided.

Less processing elements (LUTS vs. PEs) results in less time to configure and reconfigure the devices

ENG6530 RCS 22

Design Entry

Logic Logic OptimizationOptimization

Synthesis

Mapping to k-LUT

Packing LUTs to CLBs

Placement

Routing Configure an FPGA

Simulation

CAD for FPGAsCAD for FPGAs

Quine-McCluskey Method: Quine-McCluskey Method: DetailsDetails

1.1. GenerateGenerate all Prime Implicants2.2. ConstructConstruct the Prime Implicant Table3.3. ReduceReduce the Prime Implicant Table:

I. Check for essential implicants (rows) and remove them;II. Check for row dominance and remove all dominateddominated rows;III. Check for column dominance and remove all dominatingdominating columns;

4.4. RepeatRepeat I, II, III if there is any removal occurs.5.5. Final Solution:Final Solution:

– If no rows/columns left, an optimal solution found;If no rows/columns left, an optimal solution found;– Otherwise, this instance is called cyclic.Otherwise, this instance is called cyclic.

• Use a heuristicUse a heuristic• Use a Branch & Bound TechniqueUse a Branch & Bound Technique• Use Petrick’s MethodUse Petrick’s Method

ENG6530 RCS 23

ENG6530 RCS 24

FPGA Placement Problem• Input – A technology mapped netlist of Configurable

Logic Blocks (CLB) realizing a given circuit.

• Output – CLB netlist placed in a two dimensional array of slots such that total wirelength is minimized.

CLB Netlist

i1 i2 i3 i4

f1 f2

1 2 3

4 5 6 7 8

9 10

FPGA

PlacementPlacement

i1 i2 i3

i4

f2

f1

1

2

3

4

56

7

8

9

10

ENG6530 RCS 25

Global vs. Detailed RoutingGlobal vs. Detailed Routing

Global routing

LB LB LB

SB SB

LB LB LB

SB SB

LB LB LB

SB

SB

LB LB LB

SB SB

LB LB LB

SB SB

LB LB LB

SB

SB

Detailed routing

ENG6530 RCS 26

VHDL for SynthesisVHDL for Synthesis

VHDL for VHDL for SimulationSimulation

VHDL for VHDL for SynthesisSynthesis

Only a subset of the VHDL language is synthesizable The VHDL subset that is synthesizable is tool specific!

– Do not expect your VHDL description to be synthesizable with another tool.

Understand what is meant for simulation versus what is meant for producing hardware (i.e. synthesis)

ENG6530 RCS 27

VHDL Hardware VHDL Hardware CorrespondenceCorrespondence

ENG6530 RCS 28

Abstraction: AdvantagesAbstraction: Advantages

FPGA Tool Flow with ESLFPGA Tool Flow with ESL

HDL

Netlist

Bitfile

Processor FPGA

RT Synthesis

Physical Design

Technology Mapping

Placement

Routing

High-level Synthesis

C/C++, Java, etc.

29

High-Level Synthesis: HLS

• High-Level Synthesis– Creates an RTL implementation from C

level source code

– Extracts control and dataflow from Extracts control and dataflow from the source codethe source code

– Implements the design Implements the design based on defaults and user applied directivesuser applied directives

• Many implementation are possible from the same source description– Smaller designs, faster designs, optimal

designs

– Enables manualmanual design exploration

ENG6530 RCS11- 30

AutoESL or Vivado HLSAutoESL or Vivado HLS

Script withConstraintsScript withConstraints

RTL Wrapper

RTL Wrapper

………………………………………

………………………VHDLVerilog

System C

VHDLVerilog

System C

AutoESLAutoESL

Test benchTest

bench Constraints/ Directives

Constraints/ Directives

………………

………………

………………

………………C, C++,

SystemC

C, C++, System

C

RTL SimulationRTL Simulation RTL SynthesisRTL Synthesis

31

Contrasting ESLsContrasting ESLs

Handel-C VIVA Mitrion C

Hardware Software

HDL Impulse-C

VIVADO HLSSystemC

Explicit Par StatementsMemory StatementsChannels, …

Pure C/C++ statementswith Pragmas inserted

ENG6530 RCS 32

Different levels of couplingDifferent levels of coupling

FU

Workstation

Coprocessor

CPU Memory Caches

I/O Interfac

e

Standalone Processing Unit

Attached Processing Unit

Tightly CoupledTightly Coupled

Loosely CoupledLoosely Coupled

ENG6530 RCS 33

HW/SW Co-designHW/SW Co-design

SW__________________

SW__________________

SW__________________

HW__________________

SW__________________

SW__________________

ProcessorProcessor ProcessorASIC/FPGA

Critical Regions

ProfilerProfiler Benefits Speedups of 2X to

10X typical Far more potential

than dynamic SW optimizations (1.2x)

Energy reductions of 25% to 95% typical

Time Energy

SW OnlyHW/ SW

Time Energy

SW Only

ProcessorProcessor

ENG6530 RCS 34

Processor Specialization: (ASIP)

Gains!

ENG6530 RCS 35

Exploiting Parallelism

Reconfigurable Instruction Set ProcessorsReconfigurable Instruction Set Processors

Duplicated instruction decode logic (2 simmetrical data- channels)

Duplicated commonly used function Units (Alu and Shifter)

All others function units are shared (DSP operations, Memory handler)

A tightly coupled pipelined configurable Gate Array

ENG6530 RCS 37

XPRES CompilerXPRES Compiler

ENG6530 RCS 38

ReconfigurabilityReconfigurability Reconfiguration is either staticstatic (execution is interrupted), semi-staticsemi-static (also called

time-shared) or dynamicdynamic (in parallel with execution):

Static configurationStatic configuration involves hardware changes at the slow rate of days/weeks,

Semi-StaticSemi-Static (Global RTR): If an application can be pipelined, it might be possible to implement each phase in

sequence on the reconfigurable hardware. Temporal Partitioning Techniques.

Dynamic reconfiguration (Local RTR)Dynamic reconfiguration (Local RTR): The hardware reconfigures itself on the fly as it executes a task, refining its

own programming for improved performance.

Partial Reconfiguration Technology and Benefits

Partial Reconfiguration enables: System Flexibility

Perform more functions while maintaining communication links

Size and Cost Reduction Time-multiplex the hardware

to require a smaller FPGA

Power Reduction Shut down power-hungry tasks

when not needed

Fault Tolerant/self-repairing Systems Adaptive System

39 ENG6530 RCS

Power Reduction Techniques with PR Many techniques can be employed to reduce power

Swap out Swap out high-power functions for low-power functions low-power functions when maximum performance is not required

Swap out Swap out high-power I/O standards for lower-power I/O lower-power I/O when specific characteristics are not needed

Swap out Swap out black boxes for inactive regionsinactive regions Time-multiplexing functions Time-multiplexing functions will reduce power by reducing

amount of configured logic

40 ENG6530 RCS

Con

trol

ler

(Mic

rob

laze

)

ICAP

Fla

sh c

ontr

olle

r

Current PR Design Flow

• Steps– Partition the system into modules– Define static modules and reconfigurable

modules – Decide the number of PR regions (PRRs)– Decide PRR sizes, shapes and locations – Map modules to PRRs– Define PRR interfaces, instantiate slice

macros for PRR interfaces

• Optimization problems– Design partitioning– Number of PRRs– PRR sizes, shapes and locations – Mapping PRMs to PRRs– Type and placement of PRR interfaces

Module AModule C

Module B

Static modules Reconfigurable Modules (PRMs)

12

FP

GA

# of PRRs

PRR 1

PRR 2

Sta

tic

regi

on

Static modules

Modules: A and B

Modules: C

Des

ign

part

itio

ning

Des

ign

floo

rpla

nnin

g an

d bu

dget

ing

ENG6530 RCS 41

Reconfigurable Operating Systems Modern FPGA can be partially dynamically reconfigured. This

feature adds tremendous flexibility adds tremendous flexibility to the Reconfigurable Computing (RC) Field but also introduces challenges. also introduces challenges.

Reconfigurable Operating Systems tend to: Ease applications development, Higher lever of abstraction Ease application verification and maintenance.

42 ENG6530 RCS

Design Space ExplorationDesign Space Exploration

What?: What?: Rapid exploration of various architectural solutions to be implemented on reconfigurable architectures in order to select the select the most efficient architecture most efficient architecture for one or several applications

When?: When?: Take place before architectural synthesis before architectural synthesis (algorithmic specification with high level abstraction language)

How?How?: Estimations are based on a functional architecture model (generic, technology-independenttechnology-independent)

Iterative exploration flowIterative exploration flow to progressively refine the architecture definition, from a coarse model to a dedicated model

ENG6530 - Design Exploration 43

ENG6530 - Design Exploration 44

Framework OverviewFramework Overview

Core Library

Communication Sub-System

Communication Sub-System

Embedded Processors

Dedicated Hardware

Accelerator

Point to Point

Common Bus

User Constraints

User ApplicationModeling Tool

Designer

Mathematical &

Meta-Heuristics Search the design space

Evaluation of the Generation Architecture

Generated ArchitectureEvaluation Results

ASIC

FPGA

CGRA

Implementation Platform

Proposed Architecture

ENG6530 RCS 45

Final Exam Final Exam

Undergraduate Final Exam (50%)Undergraduate Final Exam (50%) Graduate Final Exam (40%)Graduate Final Exam (40%)

Comprehensive (includes most topics) Short questions that rely on understanding the topics. 2 hours long 6-8 questions Only a few formulas are introduced in the course (memorize them)