4/4/2011
1
EE 811 Advanced Digital System DesignAdvanced Digital System Design
Dr. Arshad Aziz
Basic FPGA Architecture
Technology Timeline1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000
Transistors
ASICs
CPLDs
SPLDs
Microprocessors
SRAMs & DRAMs
ICs (General)
Transistors
FPGAs
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
4/4/2011
2
Major FPGA vendors
SRAM-based FPGAsXilinx Inc – www xilinx comXilinx Inc. www.xilinx.comAltera Corp. – www.altera.comAtmel Corp. – www.atmel.comLattice Semiconductor Corp.–
www.latticesemi.com
A tif d fl h b d FPGAAntifuse and flash-based FPGAsActel Corp. – www.actel.comQuickLogic Corp. – www.quicklogic.com
Feature SRAM AntifuseE2PROM /
FLASH
State-of-the-artTechnology nodeOne or more
generations behindOne or more
generations behind
FastReprogramming
speed (inc.erasing)
----3x slower
than SRAM
YesVolatile (must
be programmedon power-up)
NoNo
(but can be if required)
YesRequires externalconfiguration file
No No
Yes(very good)
Good forprototyping
NoYes
(reasonable)
Yes(in system)
Reprogrammable NoYes (in-system
or offline)
MediumPower
consumptionLow Medium
Acceptable(especially when usingbitstream encryption)
IP Security Very Good Very Good
Large(six transistors)
Size ofconfiguration cell
Very smallMedium-small
(two transistors)
NoRad Hard Yes Not really
NoInstant-on Yes Yes
4/4/2011
3
The Programmable MarketplaceQ1 Calendar Year 2005
L tti Q i kL i XilinxXilinx
PLD Segment FPGA Sub-Segment
LatticeActel
QuickLogic: 2% XilinxXilinxOther: 2%
51%33%
5% 7%
58%
31% 11%
Source: Company reportsLatest information available; computed on a 4-quarter rolling basis
XilinxXilinxAltera All OthersAll OthersAlteraAltera
11%
FPGA Families
Low-cost High-performance
– Spartan 3 Virtex 4 LX / SX / FX– Spartan 3E Virtex 5 LX– Spartan 3L
Xilinx
Cyclone II Stratix IICyclone II Stratix II
Stratix II GX
Altera
4/4/2011
4
Xilinx• Primary products: FPGAs and the associated CAD
software
Programmable Logic Devices ISE Alliance and Foundation
Series Design Software
• Main headquarters in San Jose, CA
• Fabless* Semiconductor and Software Company
UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996}
Seiko Epson (Japan)
• TSMC (Taiwan)Source: [Xilinx Inc.]
Xilinx• Primary products: FPGAs and the associated CAD software
• Main headquarters in San Jose, CA
• Fabless* Semiconductor and Software Company
Programmable Logic Devices ISE Alliance and Foundation
Series Design Software
• UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996}
• Seiko Epson (Japan)
• TSMC (Taiwan)
Source: [Xilinx Inc.]
4/4/2011
5
Xilinx FPGA Families• Old families
– XC3000, XC4000, XC5200– Old 0.5µm, 0.35µm and 0.25µm technology. Not
recommended for modern designs.L C t F il• Low Cost Family– Spartan/XL – derived from XC4000– Spartan-II – derived from Virtex– Spartan-IIE – derived from Virtex-E– Spartan-3 (90 nm)– Spartan-3E (90 nm)– Spartan-3A (90 nm)
• High-performance familiesHigh performance families– Virtex (220 nm)– Virtex-E, Virtex-EM (180 nm)– Virtex-II, Virtex-II PRO (130 nm)– Virtex-4 (90 nm)– Virtex 5 (65 nm)
Source: [Xilinx Inc.]
General structure of an FPGA
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
4/4/2011
6
ConfigurableLogic
Xilinx FPGA
Blo
ck RA
Ms
Blo
ck RA
Ms
LogicBlocks
I/OBlocks
BlockBlockRAMs
Generic FPGA architecture:Configurable Configurable Logic Block (Logic Block (CLBCLB))
ConnectionConnectionConnection Connection BlockBlock
Switch BlockSwitch Block
Wire segmentsWire segments
Routing ChannelsRouting Channels
I/O padI/O pad
4/4/2011
7
L i ll
Slice
L i ll
Slice
Configurable logic block (CLB)
Xilinx CLB
CLB CLB
CLB CLB
Logic cell
Logic cell
Logic cell
Logic cell
Logic cell
Slice
Logic cell
Slice
Logic cell Logic cell
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
Xilinx Point of Reference
• A Xilinx CLB has FOUR slices– Each slice has TWO logic cells
– Each logic cell has TWO LUTs plus other logic (carry and control) plus a flip-flop/latch• For SLICEL slices, these LUTs can be
configured as:1 LUT1. LUT
• For SLICEM slices, these LUTs can be configured as:1. LUT
2. 16 x 1 Distributed RAM (16 words x 1 bit/word)
3. 16-bit Shift Register
4/4/2011
8
CLB Structure of Spartan 3
• Each Virtex-II CLB contains BUFT
BUF T
COUTCOUT
contains four slices– Local routing provides
feedback between slices in the same CLB, and it provides routing to neighboring CLBs
SwitchMatrix
BUF T
Slice S1
Slice S2
Slice S3
SHIFT
– A switch matrix provides access to general routing resources CIN
Slice S0 Local Routing
CIN
16-bit SR
a
16x1 RAM
4-input
Simplified view of a Xilinx Logic Cell
flip-flop
clock
muxy
qe
abcd
pLUT
clock enable
set/reset
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
4/4/2011
9
Simplified Slice Structure• Each slice has four
outputs
Slice 0
LUTLUT CarryCarry D QCE
PRE
CLR
– Two registered outputs, two non-registered outputs
– Two BUFTs associated with each CLB, accessible by all 16 CLB outputs
• Carry logic runs vertically, up only
LUTLUT CarryCarry DQCE
PRE
CLR
up only– Two independent
carry chains per CLB
Detailed Slice Structure
• The next few slides discuss the slicediscuss the slice features– LUTs
– MUXF5, MUXF6, MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram)in this diagram)
– Carry Logic
– MULT_ANDs
– Sequential Elements
4/4/2011
10
SRAM Cell (Pass Transistor)
• An SRAM cell can drive the gate (G) terminal of an NMOS transistor.
• If SRAM (M) = 1 then signals passes from S D• An SRAM cell can be attached to the select line of a
MUX to control it.
Look-Up Tables
• Combinatorial logic is stored in Look-Up Tables (LUTs)
A B C D Z
0 0 0 0 0
Combinatorial Logic
Tables (LUTs) – Also called Function Generators (FGs)
– Capacity is limited by the number of inputs, not by the complexity
• Delay through the LUT is constant
0 0 0 0 0
0 0 0 1 0
0 0 1 0 0
0 0 1 1 1
0 1 0 0 1
0 1 0 1 1
AB
CD
Z
. . .
1 1 0 0 0
1 1 0 1 0
1 1 1 0 0
1 1 1 1 1
4/4/2011
11
Look Up Table (LUT)
• The LUT is used to realize any Boolean function.
• Assume the function to be realized is y = (a&b) | !cAssume the function to be realized is y = (a&b) | !c
• This could be achieved by loading the LUT with the appropriate output values
LUT (Look-Up Table) Functionality
• Look-Up tables are primary LUT
x1x2x3x4
y
0x1
0x2 x3 x4
0 00 0 0 1
y01
0x1
0x2 x3 x4
0 00 0 0 1
y11
LUT
x1x2x3x4
y
0x1
0x2 x3 x4
0 00 0 0 1
y01
0x1
0x2 x3 x4
0 00 0 0 1
y01
0x1
0x2 x3 x4
0 00 0 0 1
y11
0x1
0x2 x3 x4
0 00 0 0 1
y11 elements for
logic implementation
• Each LUT can implement any function of 4 i
x1 x2 x3 x4
0 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
100010101001100
0 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
111111111110000
x1 x2 x3 x4x1 x2 x3 x4
0 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
100010101001100
0 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
100010101001100
0 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
111111111110000
0 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
111111111110000
inputs
y
x1 x2
y
yy
x1 x2
y
x1 x2
y
4/4/2011
12
5-Input Functions implemented using two LUTs
• One CLB Slice can implement any function of 5 inputs
• L i f i i i i d b LUT• Logic function is partitioned between two LUTs
• F5 multiplexer selects LUT
A4
A3
A2
A1WS DI
D
LUTROMRAM
0
F5
A4
A3
A2
A1WS DI
D
LUTROMRAM
A4
A3
A2
A1WS DI
D
LUTROMRAM
00
F5
1
F4
F3
F2
F1
A4
A3
A2
A1
WS DI
D
LUTROMRAM
F5
GXOR
G
nBX
BX
1
0
BX
X11
F4
F3
F2
F1
A4
A3
A2
A1
WS DI
D
LUTROMRAM
A4
A3
A2
A1
WS DI
D
LUTROMRAM
F5
GXOR
G
F5
GXOR
G
nBX
BX
1
0
nBX
BX
1
0
BX
X
5-Input Functions implemented using two LUTs
X5 X4 X3 X2 X1 Y
0 0 0 0 0 00 0 0 0 1 10 0 0 1 0 00 0 0 1 1 00 0 1 0 0 10 0 1 0 1 1
LUTLUT
0 0 1 0 1 10 0 1 1 0 00 0 1 1 1 00 1 0 0 0 10 1 0 0 1 00 1 0 1 0 00 1 0 1 1 10 1 1 0 0 10 1 1 0 1 10 1 1 1 0 10 1 1 1 1 11 0 0 0 0 01 0 0 0 1 01 0 0 1 0 01 0 0 1 1 01 0 1 0 0 0
OUT
1 0 1 0 1 01 0 1 1 0 01 0 1 1 1 11 1 0 0 0 01 1 0 0 1 11 1 0 1 0 01 1 0 1 1 11 1 1 0 0 01 1 1 0 1 11 1 1 1 0 01 1 1 1 1 0
LUTLUT
4/4/2011
13
Dedicated Expansion Multiplexers
CLB• MUXF5 combines 2 LUTs to create
MUXF6
Slice
LUT
LUTMUXF5
Slice
LUT
LUT
• Any 5-input function (LUT5)
• Or selected functions up to 9 inputs
• Or 4x1 multiplexer
• MUXF6 combines 2 slices to form• Any 6-input function (LUT6)
• Or selected functions up to 19 inputsLUT
MUXF5• 8x1 multiplexer
• Dedicated muxes are faster and more space efficient
Connecting Look-Up Tables
F8CLBMUXF8 combines the two MUXF7 outputs (from the CLB
F5F5
F6
Slice S3
Slice S2
Slice S1 F5F7
MUXF7 outputs (from the CLB above or below)
MUXF6 combines slices S2 and S3
MUXF7 combines the two MUXF6 outputs
Slice S0
Slice S1 FF5
F6 MUXF6 combines slices S0 and S1
MUXF5 combines LUTs in each slice
4/4/2011
14
Programmable Logic Block• Early devices were based on the concept of programmable
logic block, which comprised • 3 input lookup table (LUT)• 3-input lookup table (LUT),
• register that could act as flip flop or a latch,
• multiplexer, along with a few other elements.
3-, 4-, 5-, or 6-input LUTs?
• The key feature of n-input LUT is that it can implement any possible n-input combinational logic function.
• Adding more inputs allows you to represent more complex functions, but every time you add an input, you double the number of SRAM cells!• The first FPGAs were based on 3-input LUTs.
• FPGA vendors and researchers studied the relative merits of 3, 4, 5 and even 6 input LUTS.• The current consensus is that 4-input LUTS offer the optimal
balance of pros and cons.
• In the past, some devices were created using a mixture of different LUT sizes because this offered the promise of optimal device utilization.
• However current logic synthesis tools prefer uniformity and regularity
4/4/2011
15
FPGA Function generators• LUT Example: Implement the function • using:
2-input LUTs
AF = ABD + BC BCD +
2 input LUTs
3-input LUTs
4-input LUTs
ABD
ABD
BCDABC
F BCD
ABC
CD
ABF F
Each CLB contains separate logic and routing for the fast
Fast Carry Logic
logic and routing for the fast generation of sum & carry signals– Increases efficiency and
performance of adders, subtractors, accumulators, comparators, and counters LSB
MSB
Car
ry L
ogic
Rou
ting
p ,
Carry logic is independent of normal logic and routing resources
LSB
4/4/2011
16
Fast Carry Logic
• Simple, fast, and complete
COUT COUTTo S0 of the next CLB
To CIN of S2 of the next CLBcomplete
arithmetic Logic– Dedicated XOR
gate for single-level sum completion
– Uses dedicated ti
SLICE S1
next CLB CLB
First Carry Chain
SLICE S3
SLICE S2
COUTCIN
routing resources
– All synthesis tools can infer carry logic
SLICE S0
S1Second Carry Chain
COUTCIN
CIN CIN CLB
Accessing Carry Logic• All major synthesis tools can infer carry
logic for arithmetic functionslogic for arithmetic functions• Addition (SUM <= A + B)
• Subtraction (DIFF <= A - B)
• Comparators (if A < B then…)
• Counters (count <= count +1)
4/4/2011
17
D S Q
FDRSE_1
Flexible Sequential Elements
• Either flip-flops or latches
• Two in each slice; eight in each CLB
D
CE
PRE
CLR
Q
FDCPE
D
CE
S
R
Q; g
• Inputs come from LUTs or from an independent CLB input
• Separate set and reset controls– Can be synchronous or
asynchronous
• All controls are shared within a slice– Control signals can be inverted
D
CE
PRE
CLR
Q
LDCPE
G
Control signals can be inverted locally within a slice
D QCE
LUT
INCE
CLK
Shift Register
• Each LUT can be configured as shift
i tD QCE
D QCE
OUTLUT =
register– Serial in, serial out
• Dynamically addressable delay up to 16 cycles
• For programmable pipeline
• Cascade for greater cycle d l
D QCE
DEPTH[3:0]
delays• Use CLB flip-flops to add
depth
4/4/2011
18
Shift Register
64Operation A
4 Cycles 8 Cycles
Operation B
12 Cycles
• Register-rich FPGA
4 Cycles 8 Cycles
3 Cycles
Operation C64
3 Cycles9-Cycle imbalance
Register rich FPGA– Allows for addition of pipeline stages to increase
throughput
• Data paths must be balanced to keep desired functionality
Shift Register LUT Example
12 Cycles
64Operation A
4 Cycles 8 Cycles
Operation B
Operation C
64
Operation D NOP
3 Cycles
Operation C
12 Cycles
Paths are StaticallyBalanced
9 Cycles
Operation D - NOP
4/4/2011
19
RAM16X1S
O
DWE
WCLKA0A1A2A3
=LUT
Distributed RAM
• CLB LUT configurable as Distributed RAM
RAM32X1S
O
DWEWCLKA0A1A2A3A4
RAM16X2S
O1
D0
WEWCLKA0A1A2
D1
O0=
LUT orRAM16X1D
D
WE
– An LUT equals 16x1 RAM– Cascade LUTs to increase
RAM size
• Synchronous write• Asynchronous read
– Can create a synchronous read by using extra flip-flopsNaturally distributed RAM A2
A3
LUTSPO
WE
WCLK
A0
A1
A2
A3
DPRA0 DPO
DPRA1
DPRA2
DPRA3
or– Naturally, distributed RAM
read is asynchronous
• Two LUTs can make– 32 x 1 single-port RAM– 16 x 2 single-port RAM– 16 x 1 dual-port RAM
Xilinx Multipurpose LUT
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
4/4/2011
20
16-bit SR
a
16x1 RAM
4-input
Simplified view of a Xilinx Logic Cell
flip-flop
clock
muxy
qe
abcd
pLUT
clock enable
set/reset
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
RAM Blocks and Multipliers in Xilinx FPGAs
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
4/4/2011
21
Embedded Ram Blocks• A lot of applications require the use of memory, so FPGAs now
include relatively large chunks of embedded RAM called e-RAM or Block RAM (BRAM).( )
• Depending on the architecture of the component, these blocks might be positioned around the periphery of the device or organized as columns
• These blocks can be used for a variety of purposes, such as implementing standardimplementing standard single or dual port RAMs, FIFO, e.t.c.
Spartan-3Dual-Port
Block RAM
Port A
Port B
Block RAM
Block RAM
• Most efficient memory implementation– Dedicated blocks of memory
• Ideal for most memory requirements– 4 to 104 memory blocks
• 18 kbits = 18 432 bits per block (16 k without parity bits)• 18 kbits = 18,432 bits per block (16 k without parity bits)
– Use multiple blocks for larger memories
• Builds both single and true dual-port RAMs
• Synchronous write and read (different from distributed RAM)
4/4/2011
22
Spartan-3 Block RAM Amounts
Block RAM can have various configurations (port aspect ratios)
01
40
20
4,095
8,191 8+10
16k x 1
8k x 2 4k x 4
2k x (8+1)
16,383
2047
1023
16+20
( )
1024 x (16+2)
4/4/2011
23
Block RAM Port Aspect Ratios
Single-Port Block RAM
4/4/2011
24
Dual-Port Block RAM
RAMB4_S16_S8
Port A OutPort A InDOA[17:0]
WEA
ENA
RSTA
Dual-Port Bus Flexibility
18-Bit Width
Port B In2k-Bit Depth
1K-Bit Depth
Port B Out9-Bit Width
DOA[17:0]
DOB[8:0]
ADDRA[9:0]
CLKA
DIA[17:0]
WEB
ENB
RSTB
ADDRB[10:0]
CLKB
DIB[8:0]
• Each port can be configured with a different data bus width
• Provides easy data width conversion without any additional logic
4/4/2011
25
0 ADDR[12 0]
RAMB4_S1_S1
DOA[0]
WEA
ENA
RSTA
CLKA
Port A Out1-Bit Width
Port A In8K-Bit Depth
Two Independent Single-Port RAMs
0, ADDR[12:0]
1, ADDR[12:0]
Port B Out1-Bit Width
DOB[0]
ADDRA[12:0]
CLKA
DIA[0]
WEB
ENB
RSTB
ADDRB[12:0]
CLKB
DIB[0]
Port B In8K-Bit Depth
• To access the lower RAM• Added advantage of True Dual To access the lower RAM– Tie the MSB address bit to
Logic Low• To access the upper RAM
– Tie the MSB address bit to Logic High
• Added advantage of True Dual-Port
– No wasted RAM Bits• Can split a Dual-Port 16K RAM
into two Single-Port 8K RAM– Simultaneous independent access
to each RAM
Embedded Multipliers• Some functions, like multipliers are inherently slow if they are
implemented by connecting a large number of programmable logic blocks together.g g
• Current FPGA incorporate special hard wired multiplier blocks which are typically located in close proximity to the embedded RAM blocks (Arithmetic Based Applications).
4/4/2011
26
18 x 18 Embedded Multiplier
• Fast arithmetic functions– Optimized to implement
multiply / accumulate modules
18 x 18 signed multiplierFully combinationalO i l i i h CE & RST ( i li )Optional registers with CE & RST (pipeline)Independent from adjacent block RAM
18 x 18 Multiplier • Embedded 18-bit x 18-bit multiplier
– 2’s complement signed operation
M lti li i d i l• Multipliers are organized in columns
18 x 18Multiplier
Output (36 bits)
Data_A (18 bits)
Multiplier (36 bits)
Data_B (18 bits)
4/4/2011
27
Positions of Multipliers
Asynchronous 18-bit Multiplier
4/4/2011
28
18-bit Multiplier with Register
Clocktree
Flip-flops
A simple clock tree
Special clock
Clock signal fromoutside world
Special clockpin and pad
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
4/4/2011
29
Digital Clock Manager (DCM)
Clock signal fromoutside world
Special clockpin and pad
Daughter clocksused to drive
internal clock treesor output pins
ClockManager
etc.
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
Digital Clock Managers (DCM)• The clock pin is usually connected to special hard-wired function
called a clock-manager that generates “daughter clocks”.
• The daughter clocks may be used to drive internal clock trees orThe daughter clocks may be used to drive internal clock trees or external output pins that can be used to provide clocking services to other devices on the host circuit board.
• There might be multiple clock managers supporting only a subset of features (Jitter removal, Frequency Synthesis, …)
Clock signal fromoutside world
Special clockpin and pad
Daughter clocksused to drive
internal clock treesor output pins
ClockManager
etc.
4/4/2011
30
DCM: Jitter Removal• In the real world clock edges may arrive a little early or a little late.
• A fuzzy clock would result (jitter) due to the delay encountered.
• The FPGA clock manager can be used to detect and correct for• The FPGA clock manager can be used to detect and correct for this jitter and provide a “clean” daughter clock signal for use inside the device.
DCM: Frequency Synthesis
• The frequency of the clock signal being presented to the FPGA from the outside world might not be exactlythe FPGA from the outside world might not be exactly what the designer engineer wishes for.
• The clock manager can be used to generate daughter clocks with frequencies that are derived by multiplying or dividing the original signal.
4/4/2011
31
DCM: Phase Shifting
• Certain designs require the use of clocks that are phase shifted (delayed) with respect to each otherphase shifted (delayed) with respect to each other.
• Some clock managers allow you to select from fixed phase shifts of common values such as 1200 and 2400
(for a three-phase clocking scheme)
Basic I/O Block Structure
DEC
Q
Three-State
Three-StateFF Enable
SR
DEC
Q
SR
Control
Output Path
Output
Clock
Set/Reset
FF Enable
DEC
Q
SR
Input Path
Direct Input
Registered Input
FF Enable
4/4/2011
32
IOB Functionality
• IOB provides interface between the package pins and CLBspins and CLBs
• Each IOB can work as uni- or bi-directional I/O
• Outputs can be forced into High Impedance
• Inputs and outputs can be registered– advised for high-performance I/O
• Inputs can be delayed
Configurable I/O Impedances• The signals used to connect devices on today’s circuit
board often have fast edge rates.
• In order to prevent signals reflecting back it is necessary to apply appropriate terminating resistors to the FPGA input and output pins.
• In the past, resistors were applied as discrete components (outside the FPGA)FPGA).
• Today's FPGAs allow the use of internal terminating resistors whose value can be configured by the user.
4/4/2011
33
Spartan 3 Family Attributes
FPGA Nomenclature
4/4/2011
34
Spartan-3 FPGA Family Members
2001 – Virtex-II FPGA Family• Virtex-II FPGA introduced followed by Virtex-II Pro in 2003
– 444 18x18 Multipliers & 18kbit block RAMs introduced
– Gbit Serial I/O Communications & Power PC Processors Introduced
C– Complex Floating Point Algorithm Implementation now possible
• Virtex-II / Pro
– 44,000 Logic Slices
– 444 18Kbits BRAMs
– 444 18x18 Multipliers
– 2 PowerPC– 2 PowerPC Processors
– 20 Gbit I/O
– 1164 Max User I/O
4/4/2011
35
Virtex II Pro Floorplan
1 t 4 P PC
Up to 16 serial transceivers•• 622 Mbps to 3.125 Gbps622 Mbps to 3.125 Gbps
• 1 to 4 PowerPCs
• 4 to 16 multi-gigabit transceivers
• 12 to 216 multipliers
• 3,000 to 50,000 logic cells
PowerPCs
• 200k to 4M bits RAM
• 204 to 852 I/Os
Logic cells
Virtex-II Pro (Selection)
4/4/2011
36
Embedded Processor Cores (Hard and Soft)
• The majority of designs make use of microprocessors.
• These appeared as discrete devices on the circuit board.
• Lately, high-end FPGAs have become available that contain one or more embedded microprocessors (referred to as microprocessor cores).
• There are two types of cores:
• A hard microprocessor core is implemented as aA hard microprocessor core is implemented as a dedicated predefined block (two approaches)
• A soft microprocessor core is implemented by configuring a group of programmable logic blocks to act as a microprocessor.
Embedded Core (Inside)• Xilinx and Altera tend to embed one or more microprocessor
cores directly into the main FPGA fabric (PowerPC)
• In this case the design tools have to be able to take account ofIn this case the design tools have to be able to take account of the presence of these blocks in the fabric (any memory used by the core is formed from the embedded RAM blocks).
The main advantage of this scheme is the inherent speed padvantages to be gained from having the processor core in intimate proximity to FPGA fabric.
4/4/2011
37
Soft Core • As opposed to embedding a microprocessor physically into the
fabric of the chip, it is possible to configure a group of programmable logic blocks to act as a microprocessor.p g g p
• Soft cores are simpler (more primitive) and slower than their hard-core counterparts.
1. The main advantage of this scheme is that the user need only implement a core if
ADVANTAGE?ADVANTAGE?
only implement a core if he/she needs it.
2. Also, the user can instantiate as many cores as they require until they run out of resources!
Virtex Architectures
Other Families include
Built for high-performance applications
• Virtex-II Pro
• Virtex-4
• Virtex-5
Latest Family include
Basic Architecture 74
• Virtex-6
4/4/2011
38
Virtex-II Pro Architecture
Advanced FPGA Logic –99k logic cells
Contains embedded Processors and Multi-Gigabit Transceivers
High performance True Dual-port RAM - 8 Mb SelectIO™- Ultra
Technology - 1164 I/O
99k logic cells
XtremeDSP Functionality -Embedded multipliers
RocketIO™ and RocketIO X High-speed Serial Transceivers 622 Mbps to 3.125 Gbps
PowerPC™ Processors
Basic Architecture 75
400+ MHz Clock Rate - 2
XCITE Digitally Controlled Impedance -Any I/O
DCM™ Digital Clock Management - 12
130 nm, 9 layer copper in 300 mm wafer technology
Virtex-4 Family
LXLX FXFX SXSX
Advanced Silicon Modular BLock (ASMBL) ArchitectureOptimized for logic, Embedded, and Signal Processing
ResourceResource
14K14K––200K LCs200K LCsLogic
Memory
DCMs
DSP Slices
0.90.9––6 Mb6 Mb
44––1212
3232––9696
23K23K––55K LCs55K LCs
2.32.3––5.7 Mb5.7 Mb
44––88
128128––512512
12K12K––140K LCs140K LCs
0.60.6––10 Mb10 Mb
44––2020
3232––192192
Basic Architecture 76
SelectIO
RocketIO
PowerPC
Ethernet MAC
240240––960960 320320––640640240240––896896
00––24 Channels24 Channels
1 or 2 Cores1 or 2 Cores
2 or 4 Cores2 or 4 Cores
N/A
N/A
N/A
N/A
N/A
N/A
4/4/2011
39
Virtex-4 Architecture
Smart RAM New block RAM/FIFO
RocketIO™ Multi-GigabitTransceivers
Xesium ClockingTechnology
500 MHz
Tri-ModeEthernet MAC
Transceivers622 Mbps–10.3 Gbps
Advanced CLBs200K Logic Cells
Basic Architecture 77
1 Gbps SelectIO™ChipSync™ Source synch, XCITE Active Termination
PowerPC™ 405with APU Interface450 MHz, 680 DMIPS
10/100/1000 MbpsXtremeDSP™ Technology Slices
256 18x18 GMACs
Virtex-5 Family
LX LXT SXT FXTVirtex™-5 Platforms
Optimized for logic, Embedded, Signal Processing, and High-Speed Connectivity
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
Logic
On-chip RAM
DSP Capabilities
Serial I/Os
Parallel I/Os
Logic Logic/Serial DSP/Serial Emb./Serial
LX LXT SXT FXT
PowerPC® Processors
Basic Architecture 78
4/4/2011
40
Virtex-5 Architecture
Most Advanced HighMost Advanced High--Performance Real 6LUT Logic Performance Real 6LUT Logic
36Kbit Dual36Kbit Dual--Port Block RAM / Port Block RAM / FIFO with Integrated ECCFIFO with Integrated ECC
NewNewEnhancedEnhanced
550 MHz Clock Management Tile 550 MHz Clock Management Tile with DCM and PLLwith DCM and PLL
Next Generation PowerPCNext Generation PowerPC®®
Embedded ProcessorEmbedded Processor
SelectIO with ChipSync SelectIO with ChipSync Technology and XCITE DCITechnology and XCITE DCI
PCI ExpressPCI Express®® Endpoint BlockEndpoint Block
FabricFabric
Advanced Configuration OptionsAdvanced Configuration Options
gg
System Monitor Function with System Monitor Function with BuiltBuilt--in ADCin ADC
Basic Architecture 79
TriTri--Mode 10/100/1000 Mbps Mode 10/100/1000 Mbps Ethernet MACsEthernet MACs
RocketIO™ Transceiver OptionsRocketIO™ Transceiver OptionsLowLow--Power GTP: Up to 3.75 GbpsPower GTP: Up to 3.75 GbpsHighHigh--Performance GTX: Up to 6.5 Performance GTX: Up to 6.5 GbpsGbps
25x18 DSP Slice with Integrated 25x18 DSP Slice with Integrated ALUALU
The Spartan-3 Family
18x18 bit Embedded
Built for high volume, low-cost applications
Pipelined Multipliers for efficient DSP Configurable 18K Block
RAMs + Distributed RAM
Bank 0
Bank 1
Bank 2
Bank 3
Spartan-3
Basic Architecture 80
4 I/O Banks, Support for
all I/O Standards including
PCI, DDR333,RSDS, mini-LVDS
2
Up to eight on-chip Digital Clock Managers
to support multiple system clocks
4/4/2011
41
Spartan-3 Family
• Smaller process = lower core voltage– .09 micron versus .15 micron
Based upon Virtex-II Architecture – Optimized for Lower Cost
– Vccint = 1.2V versus 1.5V
• Logic resources– Only one-half of the slices support RAM or SRL16s (SLICEM)– Fewer block RAMs and multiplier blocks
• Clock Resources– Fewer global clock multiplexers and DCM blocks
• I/O Resources
Basic Architecture 81
– Fewer pins per package– No internal 3-state buffers – Support for different standards
• New standards: 1.2V LVCMOS, 1.8V HSTL, and SSTL• Default is LVCMOS, versus LVTTL
SLICEM and SLICEL• Each Spartan™-3 CLB
contains four slices COUTCOUT
Left-Hand SLICEM Right-Hand SLICEL
– Similar to the Virtex™-II
• Slices are grouped in pairs– Left-hand SLICEM
(Memory)• LUTs can be
configured as memory
SwitchMatrix
Slice X0Y1
Slice X1Y0
Slice X1Y1
SHIFTIN
Basic Architecture 82
g yor SRL16
– Right-hand SLICEL (Logic)
• LUT can be used as logic only
CIN
Slice X0Y0
Slice X0Y1
Fast Connects
CINSHIFTOUT
4/4/2011
42
Multiple Domain-optimized Platforms
Basic Architecture 83
Spartan-3E Features
• More gates per I/O than Spartan-3
• 16 BUFGMUXes on left and right sidesSpartan 3
• Removed some I/O standards– Higher-drive LVCMOS
– GTL, GTLP
– SSTL2_II
– HSTL II 18 HSTL I
and right sides– Drive half the chip only
– In addition to eight global clocks
• Pipelined multipliers
• Additional configuration modes
Basic Architecture 84
HSTL_II_18, HSTL_I, HSTL_III
– LVDS_EXT, ULVDS
• DDR Cascade– Internal data is presented
on a single clock edge
modes– SPI, BPI
– Multi-Boot mode
4/4/2011
43
Spartan-3A DSP Features
• Increased amount of block memory (BRAM)1512K of S3A1800 vs 648 K of S3E1600– 1512K of S3A1800 vs 648 K of S3E1600
• More XtremeDSP DSP48A slices– Replaces Embedded multiplier of Spartan-3E
• 3400A – 126 DSP48As• 1800A – 84 DSP48As
Basic Architecture 85
Spartan-3A DSPTuning DSP Performance
• Integrated Xt DSP Sli
XtremeDSP DSP48A Slice
XtremeDSP Slice– Application optimized
capacity– Integrated pre-adder
optimized for filters– 250 MHz operation,
standard speed grade
Basic Architecture 86
– Compatible with Virtex-DSP
• Increased memory capacity and performance– Also important for embedded processing, complex
IP, etc
4/4/2011
44
DSP48 ComparisonFunction DSP48 DSP48E DSP48A Benefit
Multiplier 18 x 18 25 x 18 18 x 18 Reduces FPGA resource needs for DSP algorithms.
Pre-Adder No No Yes Reduces the critical path timing in FIR filter applications better performance. Important in FIR filter construction.
E bl f t d t th h i i f DSP48 bl k f l filtCascade Inputs One Two One Enables fast data path chaining of DSP48 blocks for larger filters.
Cascade Output Yes Yes Yes Enables fast data path chaining of DSP48 blocks for larger filters.
Dedicated C input
No Yes Yes The C input supports many 3-input mathematical functions, such as 3-input addition and 2-input multiplication with a single addition and the very valuable rounding of multiplication away from zero.
Adder 3 input 48 bit
3 input 48 bit
2 input 48 bit
Supports simple add and accumulate functions.
Dynamic Opmodes
Yes Yes Yes One DSP48 can provide more than one function.. Multiply, Multiply-add, multiply-accumulate etc.
ALU Logic No Yes No Similar to the ALU of a microprocessor. Enables the selection of ALU
Basic Architecture 87
Functions function on a clock cycle basis Enables multiple functions to be selected. (Add, Subtract, or Compare)
Pattern Detect No Yes No This feature supports convergent rounding, underflow/overflow detection for saturation arithmetic, and auto-resetting counters/accumulators.
SIMD ALU Support
No Yes NoEnables parallel ALU operations on multiple data sets.
Carry Signals Carry In Carry In & Out
Carry In & Out
Supports fast carry functions between DSP blocks. Often a speed limiting path.
Spartan-3A Device Table
Spartan-3 Spartan-DSP
Spartan-3A Spartan-3A DSP
XC3S1400A XC3SD1800A XC3SD3400A
XtremeDSP DSP48A Slices - 84 126
Dedicated Multipliers 32 DSP48As DSP48As
Block Ram Blocks 32 84 126
Block RAM (Kb) 576 1,512 2,268
Distributed RAM (Kb) 176 260 373
FFs/LUTs 22,528 33,280 47,744
L i C ll 25 344 37 440 53 712
Basic Architecture 88
Logic Cells 25,344 37,440 53, 712
DCMs 8 8 8
Max Diff I/O Pairs 227 227 213
CS484 19x19mm (0.8mm pitch) - 309 309
*FG676 27x27mm (1.0mm pitch) 502 519 469
4/4/2011
45
Latest Families
Basic Architecture 89
Architecture AlignmentVirtex-6 FPGAs Spartan-6 FPGAs
150K Logic Cell
Device
760K Logic Cell
Device
Common Resources
LUT-6 CLB
DSP Slices
BlockRAM
High-performance Clocking
*Optimized for target application in each family
3.3 Volt compatible I/O
Hardened Memory Controllers
HSS Transceivers*
Parallel I/O FIFO Logic
System Monitor
Tri-mode EMAC
PCIe® Interface
Enables IP Portability, Protects Design InvestmentsBasic Architecture 90
4/4/2011
46
Addressing the Broad Range of Technical Requirements
Spartan-6 LX
Spartan-6 LXT
Virtex-6 LXT
Mar
ket S
ize
Lowest cost logic + DSP
Lowest logic +high-speed serial
High logic density +serial connectivity
Virtex-6 LXT
Virtex-6 SXT
Virtex-6 HXT
Application Market Segments + 100s More
DSP + logic +serial connectivity
Ultra high-speed serialconnectivity + logic
Basic Architecture 91
Designers Eccentrics
• Higher System Performance More design margin to simplify designs– More design margin to simplify designs
– Higher integrated functionality
• Lower System Cost– Reduce BOM
– Implement design in a smaller device & lower speed-gradegrade
• Lower Power– Help meet power budgets
– Eliminate heat sinks & fans
– Prevent thermal runawayBasic Architecture 92
4/4/2011
47
Virtex-6 Family
Basic Architecture 93
Virtex® Product & Process Evolution
Virtex-5
40-nm
6
Virtex-6
Virtex-E
Virtex-II
Virtex-II Pro
Virtex-4
180 nm
150-nm
65-nm
90-nm
130-nm
Virtex-6 Base Platform 94
Delivering Balanced Performance, Power, and Cost
Virtex
1st Generation1st Generation 2nd Generation2nd Generation 3rd Generation3rd Generation 4th Generation4th Generation 5th Generation5th Generation 6th Generation6th Generation
220-nm
180-nm
Basic Architecture 94
4/4/2011
48
• Static Power Reduction– Higher distribution of low leakage transistors
D i P R d ti
Strong Focus on Power Reduction
• Dynamic Power Reduction– Reduced capacitance through device shrink
• Reduced Core Voltage Devices Lower Overall Power– VCCINT = 0.9V option allows power / performance tradeoff
• I/O Power Improvements– Dynamic termination
• System Monitor– Allows sophisticated monitoring of temperature and voltage
Up to 50% Power Reduction vs. Previous GenerationBasic Architecture 95
Virtex-6 Logic Fabric
• Virtex-6 Configurable Logic Block (CLB)– Each CLB contains two slices
Each slice contains four 6 input Lookup Tables
Slice
LUT
Slice
LUTLUT
– Each slice contains four 6-input Lookup Tables (6LUT)
• Slices implement logic functions (slice_l)
• Slices for memories and shift registers (slice_m)
• LUT6 implements– All functions of up to 6 variables
CLBCLB
Slice
LUT
LUT
LUT
LUT
Slice
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
All functions of up to 6 variables
– Two functions of up to 5 or less variables each
– Shift registers up to 32 stages long
– Memories of 64 bits • Multiple configurations within a slice
Power Consumption Benefits Performance Benefits Cost Benefits• Shift register mode greatly reduces power consumption over FF implementation
• Increased ratio of slice_m – memories available closer to the source or target logic
• Can pack logic and memory functions more efficiently
Basic Architecture 96
4/4/2011
49
Higher DSP Performance
• Most advanced DSP architecture– New optional pre-adder for symmetric filters
– 25x18 multiplier– 25x18 multiplier• High resolution filters
• Efficient floating point support
– ALU-like second stage enables mapping of advanced operations
• Programmable op-code
• SIMD support
• Addition / Subtraction / Logic functions
– Pattern detector
• Lowest power consumption
• Highest DSP slice capacity
– Up to 2K DSP Slices
Basic Architecture 97
Virtex®-6 LXT / SXT FPGAs
Basic Architecture 98
4/4/2011
50
Spartan-6 Family
Basic Architecture 99
Spartan-6
• Next Generation 45nm Spartan Family– Increased performance & density
Evolutionary feature enhancements– Evolutionary feature enhancements– Dramatic cost & power reductions
• Two Silicon Platforms– LX: Cost optimized Logic, Memory– LXT: LX features plus High-Speed Serial
Connectivity– More unified & integrated with VirtexMore unified & integrated with Virtex
Delivering the Optimal Balanced of Cost, Power & Performance
Basic Architecture 100
4/4/2011
51
Spartan-6 Logic EvolutionHigher Performance, Increased Utilization
• Modified Virtex 6-input LUT– 4 additional flip-flops per NEW Efficient DesignNEW Efficient Design
SpartanSpartan--3A Series & 3A Series & EarlierEarlier LUT / FF PairLUT / FF Pair
4LUT4LUT
SpartanSpartan--66LUT / Dual FF LUT / Dual FF
PairPair
6LUT6LUT
– 4 additional flip-flops per slice
– Higher utilization for register intensive designs
• Efficient & Capable– Logic– Arithmetic functions– Distributed RAM & shift
registers
NEW Efficient DesignNEW Efficient Design
Great Great GeneralGeneral--Purpose Purpose
LogicLogic
66--input LUT & 2nd Flipinput LUT & 2nd Flip--flop for Higher flop for Higher
UtilizationUtilization
– Interconnect
• Up to 25% Higher Performance
Basic Architecture 101
Spartan-6 CLB Logic SlicesSliceM (25%)SliceM (25%) SliceL (25%)SliceL (25%) SliceX (50%)SliceX (50%)
LUT6
8 Registers
LUT6
8 Registers
LUT6
Optimized for Logic 8 Registers
Carry Logic
Wide Function Muxes
Distributed RAM / SRL logic
8 Registers
Carry Logic
Wide Function Muxes
p g
8 Registers
Slice mix chosen for the optimal balance of Cost, Power & Performance
Basic Architecture 102
4/4/2011
52
Spartan-6 Lowest Total Power
• Static power reductions– Process & architectural innovations
• Dynamic power reduction– Lower node capacitance & architectural innovations
• More hard IP functionality– Integrated transceivers & other logic reduces power – Hard IP uses less current & power than soft IP
Lower IO power• Lower IO power• Low power option -1L reduces power even
further• Fewer supply rails reduces power
Basic Architecture 103
Spartan-6 Hard Memory Controller
• New Hard Block Memory Controller– Up to 4 controllers per device
• Why a Hard Memory Block?– Very common design component
– Multiple customer benefits
Customer RequestsSpartan-6 Hard Block Memory Controller
Benefits
Higher performance • Up to 800 Mbps
Lower cost • Saves soft logic, smaller die
Lower power • Dedicated logic
Easier designs
• Timing closure no longer an issue
• Configurable MultiPort user interface
• CoreGen/MIG wizard & EDK support
Basic Architecture 104
4/4/2011
53
Memory Controller
• Only low cost FPGA with a “hard” memory controller
G t d i t f f idi• Guaranteed memory interface performance providing– Reduced engineering & board design time
– DDR, DDR2, DDR3 & LP DDR support
– Up to 12.8Mbps bandwidth for each memory controller
• Automatic calibration features
M lti t t t f i t f
DRAMDRAM
• Multiport structure for user interface– Six 32-bit programmable ports from fabric
– Controller interface to 4, 8 or 16 bit memories devices
SRAM
FLASH
EEPROM
SRAM
FLASH
EEPROM
Spartan-6Spartan-6
DRAMDDRDDR2DDR3LP DDR
DRAMDRAMDDRDDR2DDR3LP DDR
Basic Architecture 105
Integrated DSP Slice
• 250 MHz implementationFast multiplier & 48 bit
XtremeDSP DSP48A1 Slice
– Fast multiplier & 48 bit adder
– ASIC-like performance
• Input and output registers for higher speed
Optimizes FIR filter applications
Super Regional Training 106
4/4/2011
54
Better, More BRAM
• More Block RAMs– 2x higher BRAM to Logic Cell ratio than Spartan-3A 9K BRAM
platform
• More port flexibility – 18K can be split into two 9K BRAM blocks and can
be independently addressed
• Improves buffering, caching & data storage
OR 9K BRAM
18K BRAM
storage– Excellent for embedded processing, communication
protocols
– Enables DSP blocks to provide more efficient video and surveillance algorithms
• Lower Static PowerBasic Architecture 107
Compare to Spartan-3A Twice the Capabilities, Half the Power, Hard Blocks!
Feature Extended Spartan-3A (90nm) Spartan-6 (45nm)
Logic Cells (Kbit) Up to 55K Up to 150K
LUT Design 4-input LUT + FF 6-input LUT + 2FFLUT Design 4 input LUT FF 6 input LUT 2FF
Block RAM (Mbit) Up to 2 Mbit Up to 5 Mbit
Transceiver Count / Speed no Up to 8 / Up to 3.125 Gbps
Voltage Scaling No (1.2V only) Yes (1.2V, 1.0V)
Static Power (typ mW) 11 mW (smallest density) Up to 60% less!
Memory Interface 400 Mbps DDR3 800 Mbps
Max Differential IO 640 Mbps 1050 Mbps
Multipliers/DSP Up to 126 Multipliers / DSP Up to 184 DSP48 Blocks
Memory Controllers no Up to 4 Hard Blocks
Clock Management DCM Only DCM & PLL
PCI Express Endpoint no Yes, Gen 1
Security Device DNA Only Device DNA & AES
Basic Architecture 108
4/4/2011
55
Spartan-6 LX / LXT FPGAs
** All memory controller support x16 interface, except in CS225 package where x8 only is supported
Basic Architecture 109
FPGA Design Flow
4/4/2011
56
Design process (1)
Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 8031 microcontroller. Unlike in the experiment 5, this time your unit has to be able to perform an encryption algorithm by itself, executing 32 rounds…..
Specification
Library IEEE;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;
entity RC5_core isport(
clock, reset, encr_decr: in std_logic;data_input: in std_logic_vector(31 downto 0);data_output: out std_logic_vector(31 downto 0);out_full: in std_logic;key_input: in std_logic_vector(31 downto 0);key_read: out std_logic;
);end AES_core;
Verilog description (Your Verilog Source Files)
Functional simulation
Post-synthesis simulationSynthesis y
Design process (2)Implementation(Mapping, Placing & Routing)
Timing simulation
Configuration
g
On chip testing
4/4/2011
57
Design Process control from Active-HDL
architecture MLU_DATAFLOW of MLU is
VHDL description Circuit netlist
Logic Synthesis
signal A1:STD_LOGIC;signal B1:STD_LOGIC;signal Y1:STD_LOGIC;signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC;
beginA1<=A when (NEG_A='0') else
not A;B1<=B when (NEG_B='0') else
not B;Y<=Y1 when (NEG_Y='0') else
not Y1;
MUX_0<=A1 and B1;MUX 1<=A1 or B1;MUX_1< A1 or B1;MUX_2<=A1 xor B1;MUX_3<=A1 xnor B1;
with (L1 & L0) selectY1<=MUX_0 when "00",
MUX_1 when "01",MUX_2 when "10",MUX_3 when others;
end MLU_DATAFLOW;
4/4/2011
58
Synthesis Tools
… and others
XST
Features of synthesis tools
• Interpret RTL codep• Synplify Pro: Produces synthesized circuit netlist in a standard
EDIF (.edf) format– Can optionally produce .VHM (VHDL code merged into one) file
for post-synthesis simulation• XST: Produces synthesized circuit netlist in NGC format• Netlist is composed of gates in the particular Xilinx
implementation library– http://toolbox.xilinx.com/docsan/xilinx9/books/manuals.pdf has
information on librariesinformation on libraries• Give preliminary performance estimates• Some can display circuit schematics corresponding to EDIF
netlist
4/4/2011
59
Timing report after synthesisPerformance Summary *******************
Worst slack in design: -0.924
Requested Estimated Requested Estimated Clock Clock
Starting Clock Frequency Frequency Period Period Slack Type Group
-------------------------------------------------------------------------------------------------------exam1|clk 85.0 MHz 78.8 MHz 11.765 12.688 -0.924exam1|clk 85.0 MHz 78.8 MHz 11.765 12.688 0.924
inferred Inferred_clkgroup_0System 85.0 MHz 86.4 MHz 11.765 11.572 0.193
system default_clkgroup ===========================================================
Implementation
• After synthesis the entire implementation process is performed by FPGA vendor tools
4/4/2011
60
Mapping
LUT4
LUT0
LUT2
LUT3
LUT5
LUT1FF1
FF2LUT3
4/4/2011
61
Placing
CLB SLICES
FPGA
Routing
Programmable Connections
FPGA
4/4/2011
62
Map report headerRelease 7.1.03i Map H.41Xilinx Mapping Report File for Design 'exam1'
Design Information------------------Command Line : c:\Xilinx\bin\nt\map.exe -p 2S200FG256-6 -o map.ncd -pr b -k
4-cm area -c 100 -tx off exam1.ngd exam1.pcf Target Device : xc2s200Target Package : fg256Target Speed : -6Mapper Version : spartan2 -- $Revision: 1.26.6.4 $Mapped Date : Wed Nov 02 11:15:15 2005
Map reportDesign Summary--------------Number of errors: 0Number of errors: 0Number of warnings: 0Logic Utilization:Number of Slice Flip Flops: 144 out of 4,704 3%Number of 4 input LUTs: 173 out of 4,704 3%
Logic Distribution:Number of occupied Slices: 145 out of 2,352 6%Number of Slices containing only related logic: 145 out of 145 100%Number of Slices containing unrelated logic: 0 out of 145 0%g g
*See NOTES below for an explanation of the effects of unrelated logicTotal Number 4 input LUTs: 210 out of 4,704 4%
Number used as logic: 173Number used as a route-thru: 5Number used as 16x1 RAMs: 32
Number of bonded IOBs: 74 out of 176 42%Number of GCLKs: 1 out of 4 25%Number of GCLKIOBs: 1 out of 4 25
4/4/2011
63
Place & route report
Timing Score: 0
Asterisk (*) preceding a constraint indicates it was not metAsterisk (*) preceding a constraint indicates it was not met.This may be due to a setup or hold violation.
--------------------------------------------------------------------------------Constraint | Requested | Actual | Logic
| | | Levels--------------------------------------------------------------------------------TS_clk = PERIOD TIMEGRP "clk" 11.765 ns | 11.765ns | 11.622ns | 13 HIGH 50% | | |
--------------------------------------------------------------------------------OFFSET = OUT 11.765 ns AFTER COMP "clk" | 11.765ns | 11.491ns | 1
--------------------------------------------------------------------------------OFFSET = IN 11.765 ns BEFORE COMP "clk" | 11.765ns | 11.442ns | 2
--------------------------------------------------------------------------------
Post layout timing reportTiming summary:---------------
Timing errors: 0 Score: 0
Constraints cover 42912 paths, 0 nets, and 1038 connections
Design statistics:Minimum period: 11.622ns (Maximum frequency: 86.044MHz)
Minimum input required time before clock: 11.442nsMinimum output required time after clock: 11.491ns
4/4/2011
64
Post-place-and-route simulation
• After place-and-route performed, can do t l d t i l tipost-place-and-route simulation
– Now have real timing information!
– Also can do static timing analysis: shows the worst case critical path in circuit
Configuration
• Once a design is implemented, you must create a file that the FPGA can understanda file that the FPGA can understand– This file is called a bit stream: a BIT file (.bit
extension)
• The BIT file can be downloaded directly to the FPGA or can be converted into a PROM fileFPGA, or can be converted into a PROM file which stores the programming information
4/4/2011
65
Configuration of SRAM based FPGAs
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
System Gates vs. Real GatesSystem Gates vs. Real Gates• One common metric used to measure the size of a device in the ASIC
world is that of equivalent gatesequivalent gates (e(e--gate)gate)
• Convention used:• Convention used:• A 2-input NAND function to represent one equivalent gate.
• An equivalent gate consists of an arbitrary number of transistors.
• Different vendors provide different functions in their cell libraries, where each implementation of each function requires a different number of transistors (difficult to compare capacity/complexity)
• Solution: Assign each function an equivalent gateequivalent gate value and sum all th lthese values.
•• How can we establish a basis for comparison between FPGAs and How can we establish a basis for comparison between FPGAs and ASICs?ASICs?
•• Can an ASIC of 500,000 equivalent gates that needs to be migrated Can an ASIC of 500,000 equivalent gates that needs to be migrated into an FPGA fit into a particular FPGA?into an FPGA fit into a particular FPGA?
4/4/2011
66
FPGAs: System GatesFPGAs: System Gates
•• System GatesSystem Gates: A 4-input LUT can be used to represent h b t d th t t 2 i t i itianywhere between one and more than twenty 2-input primitive
logic gates.
• Rule of thumb?• Divide the system gates value by three, so a three million FPGA
system gates would equate to one million ASIC equivalent gates!!
• However, to make comparisons between two different implementations on an FPGA (i.e. Floating point adder vs. Fixed point adder) designers should use the resources available in anpoint adder) designers should use the resources available in an FPGA:• Number of 4-input LUTs used
• Number of embedded multipliers
• Number of embedded RAM blocks
State-of-the-Art FPGAs• 65-90 nm process on 300 mm wafers
• Lower cost per function (LUT + register)• Smaller and faster transistors: Higher speed
• System speed up to 500 MHz• Mainly through smart interconnects, clock management,
dedicated circuits, flexible I/O. • Integrated transceivers running at 10 Gigabits/sec
• More Logic and Better Features:• >100,000 LUTs & flip-flops• >200 embedded RAMs, and same number 18 x 18 multipliers
• 1156 i (b ll ) ith >800 GP I/O• 1156 pins (balls) with >800 GP I/O• 50 I/O standards, incl. LVDS with internal termination
• 16 low-skew global clock lines• Multiple clock management circuits
• On-chip microprocessor(s) and multi-Gbps transceivers
4/4/2011
67
Latest Devices: Capacity & Features
Xilinx Virtex-5• 65nm process• Up to 960 I/Os
Altera Stratix-II• 90nm process• Up to 1170 I/OsUp 960 /O
• >200000 logic cells• Up to 552 18kb block RAMs
(~10Mb RAM)• 450 DSP slices (18x18
multiplier-accumulator)• 20 digital clock managers
(DCM)
• Up to 1170 I/Os• 179000 logic elements• 9.6Mb embedded RAM• 96 DSP blocks: 380 18x18
multipliers
• 12 PLLs
• 24 high-speed serial transceivers (622Mb/s to 11.1Gb/s)
• Up to four PowerPC 405 cores
• Serial I/O up to 1Gb/s
• No hard processor cores
FPGAs Becoming More Attractive
C apacity
S peed
P rice
21 X Bigger
5.5 X Faster
1/91 1/92 1/93 1/94 1/95 1/96 1/97 1/98 1/99
Y earSource: Xilinx
50 X Less Expensive
4/4/2011
68
FPGA Shortcomings
• Circuit Delay
• Delay increases due to programmable switches in the FPGA routing architecture
• Area
• Configuration cells and programmable resources incur substantial area penalty
• Power
• Typically not suited for low power applicationsTypically not suited for low power applications
Performance Cost
ASIC
FPGA
ASIC
FPGA
Time to market
ASIC
FPGANeed to improve
Conclusion
• FPGAs are the main enabler of Reconfigurable ComputingFPGAs are the main enabler of Reconfigurable Computing Systems
• FPGAs fill the gap between Instruction Set Processors (GPs) and ASICS.– Advantages: Flexible, programmable, – Disadvantages: Power dissipation, performance w.r.t. ASIC
• Applicability of FPGAs relies on CAD tools provided by diff t d h Xili d Altdifferent vendors such as Xilinx and Altera
• RCS can be realized with several technologies:– FPGAs: Fine/Medium Grain– Coarse Grain Reconfigurable Architectures: CGRAs
Top Related