L2: FPGA HARDWARE - 18-545: Advanced Digital …ece545.com/F15/slides/L02_FPGA_Hardware.pdf18-545:...
Transcript of L2: FPGA HARDWARE - 18-545: Advanced Digital …ece545.com/F15/slides/L02_FPGA_Hardware.pdf18-545:...
18-545: FALL 2014
Admin stuff
Project Proposals happen on Monday Be prepared to give an in-class presentation
Lab 1 is due Wednesday, Sept. 16th
Reading Assignment #1 due today Submit a PDF/text file, don't fill in the web form
Team assignments are done
2
18-545: FALL 2014
Admin Stuff
Status reports due today No word docs, please! Be specific about what happened/is going to happen Talk about what YOU did/will do, not just what your group did Grades on the way, as general feedback
3
18-545: FALL 2014
Game Plan
Overview
Why use FPGAs?
FPGA Internals
4
Caveat: I will use Xilinx specific terminology since that’s the FPGA company you will be using. Beware that other companies use different terms
FPGA Overview
Field Programmable Gate Array Array of generic logic gates Gates where logic function can be programmed Programmable interconnection between gates Fielded systems can be programmed
i.e. post-fabrication
18-545: FALL 2014
Design Platform
Virtex-5 Development System Xilinx XC5VLX110T FPGA
17280 slices of CLB goodness 256MB DDR2 (SODIMM) DVI Video port
VGA port is for input 10/100/1000 Ethernet port Audio Codec (AC97) USB2 port 16x2 LCD, RS-232 Compact Flash card slot Expansion connectors
7
Why use FPGAs?
System designers have a Goldilocks problem
Off-the-shelf parts are not efficient enough Custom ASICs cost too much Need a “just right” solution
ASIC Design
Difficult to design Large and complex Issues in advanced processes
Interconnect delay Device leakage Power density constraints
Expensive to design / fabricate Mask set costs Non-recurring engineering costs
Need a high-volume, high-profit market to justify costs!
Efficiency View An efficiency gap exists between ASICs and CPUs !N. Zhang, et. al, “The Cost of Flexibility in Systems on a Chip Design for Signal Processing Applications”
0.01
0.1
1
10
100
1000
10000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Energy Efficiency (MOPS/mW) Area Efficiency (MOPS/mm2)
Microprocessors
ASICsDSPs
Economic View FPGAs: High package costs ($300+), low NRE costs ASICs: Low package costs (pennies), high NRE costs ($600K+)
Dev
elop
men
t Cos
t + D
evic
e C
ost
•Increasing NRE charge •58% are late to market -- impacts total volumes shipped
•ASIC cycle longer than some market windows
•Over 50% need to be respunTotal Units
Additional ASIC costs:
Decreasing FPGA unit cost pushing crossover
point to the right
ASIC Trend
FPGA Trend
(Courtesy Xilinx, Inc.)
FPGA solution has a lower total cost
ASIC solution has a lower total cost
18-545: FALL 2014
FPGA Advantages
Higher performance than CPU solution
Lower power than CPU solution (usually)
Low NRE costs
Off-the-shelf part designed by FPGA vendor
You are sharing NRE costs with all other customers
Fast design time
Low time-to-market
Fast re-design / re-fabrication time
Easy to correct an error, to add functionality, in response to spec change
Can even change product after deployment
13
18-545: FALL 2014
High per-part costs Good for low to middle volume applications High volume applications should consider ASICs
Perhaps use FPGA for prototyping
Lower performance than ASIC
Higher power than ASIC
More specialized design skills than programming a CPU
14
FPGA Disadvantages
Example uses of FPGAs
Rapid Prototyping Emulation of ASIC design Design exploration Verification
Shipping product Networking Military Microsoft Bing Datacenters
Reconfigurable Computing
FPGA Breakdown
3 Basic components Configurable Logic Blocks General purpose interconnect I/O Blocks
Advanced components Hard macros
CPUs Block RAM Multipliers
Specialized componentsVIRTEX-II PRO
CLB
(64 TOTAL)
I/O BLOCK
(64 TOTAL)
GENERAL
PURPOSE
INTERCONNECTIOBS HAVE DIRECT
ACCESS TO
ADJACENT CLBS
SWITCH
MATRIX
(COURTESY XILINX, INC.)
XILINX XC3020
ZOOMED IN VIEW OF THE CLB MATRIX OF
THE FPGA
SPECIFIC INGRESS AND EGRESS CONNECTION
OPTIONS (BLACK DOTS) ARE AVAILABLE
EVEN MORE ZOOMED IN VIEW
(COURTESY XILINX, INC.)
ROUTING
ONLY CERTAIN CONNECTION
PATTERNS ARE POSSIBLE
(COURTESY XILINX, INC.)
ROUTING: THE SWITCH MATRIXEACH MATRIX
HAS 5 CONNECTIONS
PER SIDE
18-545: FALL 2014
Hierarchical Routing
22
Spartan-2 and more recent have different length connections between switch matrices
Local roads, limited access roads, interstate highways Routes across entire chip don’t burn lots of short connections
Configurable Logic Blocks
CLBs get more and more stuff crammed in them over time
XC3K family had LUT (5 variable input, 2 FF values, 2 outputs), 2 FFs, clock enable, FF reset (direct / global) and 9 muxes
~51 bits of configuration SRAM per CLB
(COURTESY XILINX, INC.)
18-545: FALL 2014
What’s a Look-up-table (LUT)?
A direct implementation of a truth table, using memory LUT inputs are memory address values LUT outputs are the memory data value
24
LUT
ABCD
F
A B C D F0 0 0 0 10 0 0 1 10 0 1 0 10 0 1 1 10 1 0 0 10 1 0 1 10 1 1 0 10 1 1 1 11 0 0 0 11 0 0 1 11 0 1 0 11 0 1 1 11 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 0
A B C D F0 0 0 0 00 0 0 1 10 0 1 0 00 0 1 1 00 1 0 0 00 1 0 1 10 1 1 0 00 1 1 1 11 0 0 0 01 0 0 1 11 0 1 0 01 0 1 1 01 1 0 0 11 1 0 1 11 1 1 0 01 1 1 1 0
AB
F
A
B
C
D
F
18-545: FALL 2014
Another View of LUTs
25
D Q
D Q
D Q
D Q
D Q
D Q
16 x 1mux
Inputs
Output16
Programmed as part of configuration bitstream
Can view LUT as 16:1 mux
Inputs are mux select
Config sets mux data inputs
Logically same as 16x1 memory
Can compact logic if you can route inputs to mux data inputs
Look Up Table Additional Functionality
§ Can be configured as: ♦Shift register (16 regs) ♦Small memory (16 bits) • “Distributed RAM” !
§ Some other FPGAs use muxes instead of memories to implement the core combinational logic
18-545: FALL 2014
Spar tan-2 CLB
Spartan-2 has 2 LUTs (4 input each) feeding a 3rd LUT, 2 FFs (with Preset/Reset, Enable, posedge or negedge clocks) and 16 muxes
12 inputs (plus clock), 4 outputs
(COURTESY XILINX, INC.)27
Spar tan-3
CLBs are composed of 4 slices Organized as 2 pairs, one of which is optimized for memory access
Each slice has 2 FFs and 2 LUTs
(COURTESY XILINX, INC.)
FPGA Families extend Architecture
❏Devices are built, with more capability, but around the same basic architecture
❏Some additional capabilities ◆Low voltage versions ◆Faster clock rates ◆Different packaging options
(Courtesy Xilinx, Inc.)
FIFO
memory
chips
The need for more stuff
❏CompEs cannot design on logic, routing, I/O alone ❏Extreme case from early 90s
◆16 port ATM switch, designed on a single board !
◆Design is limited by I/O to memory chips--bring them on-chip
FPGAs (XC3Ks)
30
Other “Stuff”
❏Clock managers ◆Global clock buffering, distribution ◆DCM: eliminate skew, phase shifts, multiply or divide clock
❏Memory ◆Block RAM ◆Distributed RAM (repurposed LUTs)
❏Shift Registers ❏Dedicated Multiplexers ❏Carry Look-Ahead Generators ❏I/O Blocks
◆SelectIO supports 18 standards (single, differential, various voltage levels, ....)
❏Embedded Multipliers31
Block RAMs
§ Distributed RAM ♦Use LUTs as memories ♦Low density ♦Poor performance !
§ Block RAM ♦Large-ish dedicated memory blocks •Xilinx BRAMs = 18Kb
♦Some configurability •Dual-port •Data width / depth •FIFO, CAM, etc.
Multipliers
18x18 signed 2’s-complement multiplier § Two 18b inputs § One 36b output § 18b enough for many DSP applications § Can gang multiple units together for wider data § Faster and lower power than multiplier from CLBs
CPU Connectivity: PLB and OPB
IBM Core Connect § Processor Local Bus (PLB) - fast on-chip communication § On-Chip Peripheral Bus (OPB) - optimized for periphs. (UART, etc) § Device Control Register bus (DCR) - used to send and set config.
CPU Connectivity: OCM
On-Chip Memory controller § CPU ßàblock RAM § 2 OCMs – I and D § Direct, fast interface § Can use dual-port BRAMs for
producer-consumer link to FPGA fabric
18-545: FALL 2014
CPU Links
A lot more details on the embedded CPU
§ http://www.xilinx.com/bvdocs/userguides/ppc_ref_guide.pdf
§ http://direct.xilinx.com/bvdocs/userguides/ug018.pdf
§ http://www-3.ibm.com/chips/techlib/techlib.nsf/productfamilies/CoreConnect_Bus_Architecture
39
Zynq 7000
Advanced Microcontroller Bus Interface + Advanced eXtensible Interconnect !
To memory, FPGA fabric, I/O & Peripherals !
AMBA = ARM’s attempt at The One True Interface
Configuration Storage
Lots of configuration bits LUTs, routing, I/O configuration Xilinx XC2VP30 has >11Mb
Configuration storage technologies Volatile
SRAM cells Non-volatile
FLASH, EEPROM Anti-fuse
Actel anti-fuse
WL
bit bit_b6T SRAM cell
18-545: FALL 2014
Configuration
How to load (scan) configuration bits (bitstream) Connect all configuration registers into single long shift register Serially clock in configuration bits Most designs use standard scan interface (JTAG) developed for test
Bitstream source Non-volatile memory
On-board FLASH, EEPROM, serial memory External media (CF card)
Attached workstation
Can encrypt bitstream to conceal configuration
42