1
George Mason University
FPGA Devices & FPGA Design Flow
ECE 545 Lecture 8
2
Required Reading
Xilinx, Inc. Spartan-3 FPGA Family Spartan-3 FPGA Family Data Sheet Module 1:
• Introduction • Features • Architectural Overview • Package Marking
Module 2: • CLB Overview
3
Required Reading
Xilinx, Inc. Spartan-3 FPGA Family Spartan-3 Generation FPGA User Guide
Chapter 5 Using Configurable Logic Blocks (CLBs) Chapter 6 Using Look-Up Tables as Distributed RAM Chapter 7: Using Look-Up Tables as Shift Registers (SRL16) Chapter 9: Using Carry and Arithmetic Logic
4
• designs must be sent for expensive and time consuming fabrication in semiconductor foundry
• bought off the shelf and reconfigured by designers themselves
Two competing implementation approaches
ASIC Application Specific
Integrated Circuit
FPGA Field Programmable
Gate Array
• designed all the way from behavioral description to physical layout
• no physical layout design; design ends with a bitstream used to configure a device
5
Block R
AM
s
Block R
AM
s
Configurable Logic Blocks
I/O Blocks
What is an FPGA?
Block RAMs
6
Which Way to Go?
Off-the-shelf
Low development cost
Short time to market
Reconfigurability
High performance
ASICs FPGAs
Low power
Low cost in high volumes
2
7
Other FPGA Advantages
• Manufacturing cycle for ASIC is very costly, lengthy and engages lots of manpower • Mistakes not detected at design time have
large impact on development time and cost • FPGAs are perfect for rapid prototyping of
digital circuits • Easy upgrades like in case of software • Unique applications
• reconfigurable computing
8
Major FPGA Vendors
SRAM-based FPGAs • Xilinx, Inc. • Altera Corp. • Atmel • Lattice Semiconductor
Flash & antifuse FPGAs • Microsemi SoC Products Group
(formerly Actel Corp.) • Quick Logic Corp.
Share about 85% of the market
9
Xilinx Primary products: FPGAs and the associated CAD
software
Main headquarters in San Jose, CA Fabless* Semiconductor and Software Company
UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996} Seiko Epson (Japan) TSMC (Taiwan) Samsung (Korea)
Programmable Logic Devices ISE Alliance and Foundation
Series Design Software
10
Xilinx FPGA Families • Old families
• XC3000, XC4000, XC5200 • Old 0.5µm, 0.35µm and 0.25µm technology. Not recommended for modern
designs. • High-performance families
• Virtex (220 nm) • Virtex-E, Virtex-EM (180 nm) • Virtex-II (130 nm) • Virtex-II PRO (130 nm) • Virtex-4 (90 nm) • Virtex-5 (65 nm) • Virtex-6 (40 nm)
• Low Cost Family • Spartan/XL – derived from XC4000 • Spartan-II – derived from Virtex • Spartan-IIE – derived from Virtex-E • Spartan-3 (90 nm) • Spartan-3E (90 nm) – logic optimized • Spartan-3A (90 nm) – I/O optimized • Spartan-3AN (90 nm) – non-volatile, • Spartan-3A DSP (90 nm) – DSP optimized • Spartan-6 (45 nm)
11 George Mason University
CLB Structure
3
13
The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
General structure of an FPGA
14
The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
Xilinx Spartan 3 CLB
15
COUT
D Q
CK
S
R EC
D Q
CK
R EC
O
G4 G3 G2 G1
Look-Up Table Carry
& Control Logic
O
YB
Y
F4 F3 F2 F1
XB X
Look-Up Table
F5IN
BY SR
S
Carry &
Control Logic
CIN CLK CE SLICE
CLB Slice = 2 Logic Cells
16
The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
Xilinx Multipurpose LUT (MLUT)
16 x 1 ROM (logic)
17
Spartan 3 CLB Structure
18
CLB Slice Structure • Each slice contains two sets of the
following: • Four-input LUT
• Any 4-input logic function (16x1 ROM), • or 16-bit x 1 sync RAM (SLICEM only) • or 16-bit shift register (SLICEM only)
• Carry & Control • Fast arithmetic logic • Multiplier logic • Multiplexer logic
• Storage element • Latch or flip-flop • Set and reset • True or inverted inputs • Sync. or async. control
4
19
COUT
D Q
CK
S
R EC
D Q
CK
R EC
O
G4 G3 G2 G1
Look-Up Table Carry
& Control Logic
O
YB
Y
F4 F3 F2 F1
XB X
Look-Up Table
F5IN
BY SR
S
Carry &
Control Logic
CIN CLK CE SLICE
Multipurpose Look-Up Table (MLUT)
20
The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
MLUT as 16x1 ROM
21
LUT (Look-Up Table) Functionality
• Look-Up tables are primary elements for logic implementation
• Each LUT can implement any function of 4 inputs
22
5-Input Functions implemented using two LUTs • One CLB Slice can implement any function of 5 inputs • Logic function is partitioned between two LUTs • F5 multiplexer selects LUT
23
5-Input Functions implemented using two LUTs
LUT LUT
LUT LUT
OUT
24
The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
MLUT as 16x1 RAM
5
25
RAM16X1S
O
DWE
WCLK A0 A1 A2 A3
RAM32X1S
O
DWE WCLK A0 A1 A2 A3 A4
RAM16X2S
O1
D0
WE WCLK A0 A1 A2 A3
D1
O0
=
= LUT
LUT or
LUT
RAM16X1D
SPO
DWE
WCLK A0 A1 A2 A3 DPRA0 DPO DPRA1 DPRA2 DPRA3
or
Distributed RAM
• CLB LUT configurable as Distributed RAM • A single LUT equals 16x1
RAM • Two LUTs Implement Single
and Dual-Port RAMs • Cascade LUTs to increase
RAM size • Synchronous write • Synchronous/Asynchronous
read • Accompanying flip-flops used
for synchronous read
26
The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
MLUT as 16-bit Shift Register (SRL16)
27
D Q CE
D Q CE
D Q CE
D Q CE
LUT IN
CE CLK
DEPTH[3:0]
OUT LUT =
Shift Register
• Each LUT can be configured as shift register • Serial in, serial out
• Dynamically addressable delay up to 16 cycles
• For programmable pipeline
• Cascade for greater cycle delays
• Use CLB flip-flops to add depth
28
Using Multipurpose Look-Up Tables in the Shift Register Mode (SRL16)
ECE 448 – FPGA and ASIC Design with VHDL
Inferred from behavioral description in VHDL for shift-registers with - one serial input, one serial output - no reset, no set
29
Cascading LUT Shift Registers into Shift Registers Longer than 16 bits
ECE 448 – FPGA and ASIC Design with VHDL 30
Shift Register
• Register-rich FPGA • Allows for addition of pipeline stages to increase
throughput • Data paths must be balanced to keep desired
functionality
64 Operation A
4 Cycles 8 Cycles
Operation B
3 Cycles
Operation C 64
12 Cycles
3 Cycles 9-Cycle imbalance
6
31 ECE 448 – FPGA and ASIC Design with VHDL
COUT
D Q
CK
S
R EC
D Q
CK
R EC
O
G4 G3 G2 G1
Look-Up Table Carry
& Control Logic
O
YB
Y
F4 F3 F2 F1
XB X
Look-Up Table
F5IN
BY SR
S
Carry &
Control Logic
CIN CLK CE SLICE
Carry & Control Logic Full-adder
x y
cout
s FA
x + y + cin = ( cout s )2 2 1
x y cout s 0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
0 0 0 1 0 1 1 1
0 1 1 0 1 0 0 1
cin
0 1 0 1 0 1 0 1
cin
Full-adder Alternative implementations
x y cout s 0 0 1 1
0 1 0 1
0
1
cin cin
cin
cin
cin cin
x y
A2
A1 XOR
D
0 1
Cin
Cout
S
p
g
Full-adder Alternative implementations
Implementation used to generate fast carry logic in Xilinx FPGAs
x y cout 0 0 1 1
0 1 0 1
y
y
cin cin
p = x ⊕ y g = y s= p ⊕ cin = x ⊕ y ⊕ cin
Carry & Control Logic in Spartan 3 FPGAs
LUT
Hardwired (fast) logic
Simplified View of Spartan-3 FPGA Carry and Arithmetic Logic in One
Logic Cell
7
Simplified View of Carry Logic in One Spartan 3 Slice
Critical Path for an Adder Implemented Using Xilinx Spartan 3/Spartan 3E FPGAs
Number and Length of Carry Chains for Spartan 3 FPGAs
Bottom Operand Input to Carry Out Delay TOPCYF
0.9 ns for Spartan 3 0.2 ns for Spartan 3
Carry Propagation Delay tBYP
8
Carry Input to Top Sum Combinational Output Delay TCINY
1.2 ns for Spartan 3
Critical Path Delays and Maximum Clock Frequencies (into account surrounding registers)
45
Each CLB contains separate logic and routing for the fast generation of sum & carry signals • Increases efficiency and
performance of adders, subtractors, accumulators, comparators, and counters
Carry logic is independent of normal logic and routing resources
Fast Carry Logic
LSB
MSB
Carry
Log
ic Ro
utin
g
46
Accessing Carry Logic
All major synthesis tools can infer carry logic for arithmetic functions
• Addition (SUM <= A + B) • Subtraction (DIFF <= A - B) • Comparators (if A < B then…) • Counters (count <= count +1)
47
Logic Cell = ½ of a CLB Slice
ECE 448 – FPGA and ASIC Design with VHDL 48
CLB Slice = 2 Logic Cells
ECE 448 – FPGA and ASIC Design with VHDL
9
George Mason University
Examples:
Determine the amount of Spartan 3 resources needed to implement a given circuit
R0 R1 R2 R3 R4 R5 R6 R7 R8 R9
R10 R11 R12 R13 R14 R15
w
a b c d
y F
m
clk
0 1 run Circuit 1: Top level
1
0 1
0
0 1 2 3 4 5 6 7
cin
x y
cout
s
<<<3
x3
x2
x1
x0
y3
y2
y1
y0
w1
w0
En
y3
y2
y1
y0
a
b c d
a
b
c
d c
a b
e
e f
3
2-to-4 Decoder
Full Adder
f
g
h
g h
y
Circuit 1: F – function R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
z
a b c d e
y F
d
clk
0 1 run Circuit 2: Top level
1
0 1
0
0 1 2 3 4 5 6 7
x y
cout
s
>>2
x3
x2
x1
x0
y3
y2
y1
y0
y1
y0
z
w3
w2
w1
w0
a
b
c
d
a e
f
g h
3
Priority Encoder
Half Adder
g
h
i
e i
y
a
b
c
d
Circuit 2: F – function
Circuit 3: Top level
10
Circuit 4: Top level
George Mason University
Other Components of Spartan 3 FPGAs
57
RAM Blocks and Multipliers in Xilinx FPGAs
The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
58
Combinational and Registered Multiplier
ECE 448 – FPGA and ASIC Design with VHDL
59
Dedicated Multiplier Block
60
Block RAM
Spartan-3 Dual-Port
Block RAM
Port A
Port B
Block RAM
• Most efficient memory implementation • Dedicated blocks of memory
• Ideal for most memory requirements • 4 to 36 memory blocks in Spartan 3
• 18 kbits = 18,432 bits per block (16 k without parity bits) • Use multiple blocks for larger memories
• Builds both single and true dual-port RAMs • Synchronous write and read (different from distributed RAM)
11
61
Block RAM can have various configurations (port aspect ratios)
0
16,383
1
4,095
4 0
8,191
2 0
2047
8+1 0
1023
16+2 0
16k x 1
8k x 2 4k x 4
2k x (8+1)
1024 x (16+2)
62
Block RAM Port Aspect Ratios
63
Single-Port Block RAM
DI[w-p-1:0] DO[w-p-1:0]
64
Dual-Port Block RAM
DIA[wA-pA-1:0]
DOA[wA-pA-1:0]
DOA[wB-pB-1:0]
DIB[wB-pB-1:0]
George Mason University
Input/Output Blocks (IOBs)
66
Basic I/O Block Structure
D EC
Q
SR
D EC
Q
SR
D EC
Q
SR
Three-State Control
Output Path
Input Path
Three-State
Output
Clock
Set/Reset
Direct Input
Registered Input
FF Enable
FF Enable
FF Enable
12
67
IOB Functionality
• IOB provides interface between the package pins and CLBs
• Each IOB can work as uni- or bi-directional I/O • Outputs can be forced into High Impedance • Inputs and outputs can be registered
• advised for high-performance I/O • Inputs can be delayed
George Mason University
Spartan-3 Family Attributes
69
Spartan-3 FPGA Family Members
70
FPGA Nomenclature
71
FPGA Nomenclature Example
XC3S1500-4FG320
Spartan 3 family
1500 k = 1.5 M
equivalent logic gates
speed grade
-4 = standard
performance
320 pins
package type
George Mason University
FPGA Design Flow
13
FPGA Design process (1) Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 8031 microcontroller. Unlike in the experiment 5, this time your unit has to be able to perform an encryption algorithm by itself, executing 32 rounds…..
Library IEEE; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all;
entity RC5_core is port( clock, reset, encr_decr: in std_logic; data_input: in std_logic_vector(31 downto 0); data_output: out std_logic_vector(31 downto 0); out_full: in std_logic; key_input: in std_logic_vector(31 downto 0); key_read: out std_logic; ); end AES_core;
Specification / Pseudocode
VHDL description (Your Source Files) Functional simulation
Post-synthesis simulation Synthesis
On-paper hardware design (Block diagram & ASM chart)
FPGA Design process (2)
Implementation
Configuration
Timing simulation
On chip testing
75
Tools used in FPGA Design Flow
Xilinx XST
Design
Synthesis
Implementation Xilinx ISE
VHDL code
Netlist
Bitstream
Synplify Premier
Functionally verified
VHDL code
George Mason University
Synthesis
77
Synthesis Tools
… and others
Synplify Premier Xilinx XST
78
architecture MLU_DATAFLOW of MLU is
signal A1:STD_LOGIC; signal B1:STD_LOGIC; signal Y1:STD_LOGIC; signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC;
begin A1<=A when (NEG_A='0') else not A; B1<=B when (NEG_B='0') else not B; Y<=Y1 when (NEG_Y='0') else not Y1; MUX_0<=A1 and B1; MUX_1<=A1 or B1; MUX_2<=A1 xor B1; MUX_3<=A1 xnor B1;
with (L1 & L0) select Y1<=MUX_0 when "00", MUX_1 when "01", MUX_2 when "10", MUX_3 when others;
end MLU_DATAFLOW;
VHDL description Circuit netlist
Logic Synthesis
14
79
Circuit netlist (RTL view)
80
Mapping
LUT2
LUT3
LUT4
LUT5
LUT1 FF1
FF2
LUT0
incrementer comparator MUX
Technology view is presented using device primitives Ports, nets and
blocks browser
Pay attention: technology view is usually large and presented on number of sheets
15
George Mason University
Implementation
87
Implementation
• After synthesis the entire implementation process is performed by FPGA vendor tools
88
89
Translation
Translation
UCF
NGD
EDIF NCF
Native Generic Database file
Constraint Editor or Text Editor
User Constraint File
Native Constraint
File
Electronic Design Interchange Format
Circuit netlist Timing Constraints
Synthesis
90
Mapping
LUT2
LUT3
LUT4
LUT5
LUT1 FF1
FF2
LUT0
16
91
Placing CLB SLICES
FPGA
92
Routing Programmable Connections
FPGA
93
Configuration
• Once a design is implemented, you must create a file that the FPGA can understand • This file is called a bit stream: a BIT file (.bit extension)
• The BIT file can be downloaded directly to the FPGA, or can be converted into a PROM file which stores the programming information
95 ECE 448 – FPGA and ASIC Design with VHDL
Report files
96
Map report header
Release 8.1i Map I.24 Xilinx Mapping Report File for Design 'Lab3Demo'
Design Information ------------------ Command Line : c:\Xilinx\bin\nt\map.exe -p 3S1500FG320-4 -o map.ncd -pr b -k 4 -cm area -c 100 Lab3Demo.ngd Lab3Demo.pcf Target Device : xc3s1500 Target Package : fg320 Target Speed : -4 Mapper Version : spartan3 -- $Revision: 1.34 $ Mapped Date : Tue Feb 13 17:04:54 2007
17
97
Map report Design Summary -------------- Number of errors: 0 Number of warnings: 0 Logic Utilization: Number of Slice Flip Flops: 30 out of 26,624 1% Number of 4 input LUTs: 38 out of 26,624 1% Logic Distribution: Number of occupied Slices: 33 out of 13,312 1% Number of Slices containing only related logic: 33 out of 33 100% Number of Slices containing unrelated logic: 0 out of 33 0% *See NOTES below for an explanation of the effects of unrelated logic Total Number 4 input LUTs: 62 out of 26,624 1% Number used as logic: 38 Number used as a route-thru: 24 Number of bonded IOBs: 10 out of 221 4% IOB Flip Flops: 7 Number of GCLKs: 1 out of 8 12%
98
Related and Unrelated Logic Related logic is defined as being logic that shares connectivity – e.g. two LUTs are "related" if they share common inputs. When assembling slices, Map gives priority to combine logic that is related. Doing so results in the best timing performance.
Unrelated logic shares no connectivity. Map will only begin packing unrelated logic into a slice once 99% of the slices are occupied through related logic packing.
Note that once logic distribution reaches the 99% level through related logic packing, this does not mean the device is completely utilized. Unrelated logic packing will then begin, continuing until all usable LUTs and FFs are occupied. Depending on your timing budget, increased levels of unrelated logic packing may adversely affect the overall timing performance of your design.
99
Place & route report
Asterisk (*) preceding a constraint indicates it was not met. This may be due to a setup or hold violation.
------------------------------------------------------------------------------------------------------ Constraint | Requested | Actual | Logic | Absolute |Number of | | | Levels | Slack |errors ------------------------------------------------------------------------------------------------------ * TS_CLOCK = PERIOD TIMEGRP "CLOCK" 5 ns | 5.000ns | 5.140ns | 4 | -0.140ns | 5 HIGH 50% | | | | | ------------------------------------------------------------------------------------------------------ TS_gen1Hz_Clock1Hz = PERIOD TIMEGRP "gen1 | 5.000ns | 4.137ns | 2 | 0.863ns | 0 "gen1Hz_Clock1Hz" 5 ns HIGH 50% | | | | | ------------------------------------------------------------------------------------------------------
100
Post layout timing report Clock to Setup on destination clock CLOCK ---------------+---------+---------+---------+---------+ | Src:Rise| Src:Fall| Src:Rise| Src:Fall| Source Clock |Dest:Rise|Dest:Rise|Dest:Fall|Dest:Fall| ---------------+---------+---------+---------+---------+ CLOCK | 5.140| | | | ---------------+---------+---------+---------+---------+
Timing summary: ---------------
Timing errors: 9 Score: 543
Constraints cover 574 paths, 0 nets, and 187 connections
Design statistics: Minimum period: 5.140ns (Maximum frequency: 194.553MHz)
Technology Low-‐cost High-‐performance
120/150 nm Virtex 2, 2 Pro
90 nm Spartan 3 Virtex 4
65 nm Virtex 5
45 nm Spartan 6
40 nm Virtex 6
Xilinx FPGA Devices Altera FPGA Devices
Technology Low-‐cost Mid-‐range High-‐performance
130 nm Cyclone Stra<x
90 nm Cyclone II StraDx II
65 nm Cyclone III Arria I StraDx III
40 nm Cyclone IV Arria II StraDx IV
Top Related