hoplite-dsp - FPL 2016 · Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently...
Transcript of hoplite-dsp - FPL 2016 · Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently...
Hoplite-DSP Harnessing the Xilinx DSP48
Multiplexers to efficiently support NoCs on FPGAs
Chethan Kumar H B and Nachiket Kapre [email protected]
Hoplite — FPL 2015 paper
• Jan Gray co-author
• Specs— 60 LUTs+100 FFs— 2.9ns clock
• Smallest FPGA router available + RTL code
2
Router LUTs FFs ClockPenn 1.7K 541 4.5nsCMU 1.5K 635 9.6ns
Hoplite — FPL 2015 60 100 2.9ns
32b payload + Virtex-6 240T
3
Router LUTs FFs ClockPenn 1.7K 541 4.5nsCMU 1.5K 635 9.6ns
Hoplite — FPL 2015 60 100 2.9ns
32b payload + Virtex-6 240T
25x
3
Router LUTs FFs ClockPenn 1.7K 541 4.5nsCMU 1.5K 635 9.6ns
Hoplite — FPL 2015 60 100 2.9ns
32b payload + Virtex-6 240T
25x 5x
3
Router LUTs FFs ClockPenn 1.7K 541 4.5nsCMU 1.5K 635 9.6ns
Hoplite — FPL 2015 60 100 2.9ns
32b payload + Virtex-6 240T
25x 5x 1.5x
3
Router LUTs FFs ClockHoplite
FPL 2015 70 140 2.7ns
Hoplite-DSP FPL 2016 13 17 2.8ns
47b payload + Virtex-7 485T
4
Router LUTs FFs ClockHoplite
FPL 2015 70 140 2.7ns
Hoplite-DSP FPL 2016 13 17 2.8ns
47b payload + Virtex-7 485T
5
Router LUTs FFs ClockHoplite
FPL 2015 70 140 2.7ns
Hoplite-DSP FPL 2016 13 17 2.8ns
47b payload + Virtex-7 485T
5x
5
Router LUTs FFs ClockHoplite
FPL 2015 70 140 2.7ns
Hoplite-DSP FPL 2016 13 17 2.8ns
47b payload + Virtex-7 485T
5x 8x
5
Router LUTs FFs ClockHoplite
FPL 2015 70 140 2.7ns
Hoplite-DSP FPL 2016 13 17 2.8ns
47b payload + Virtex-7 485T
5x 8x ~
5
Router LUTs FFs ClockHoplite
FPL 2015 70 140 2.7ns
Hoplite-DSP FPL 2016 13 17 2.8ns
47b payload + Virtex-7 485T
5x 8x ~+ 1 DSP48
6
7
Motivation
• Close the gap vs. embedded NoCs — do we really want clean-slate hard NoCs?
• Return resources to FPGA application — reduce NoC overheads
• Find clever ways to reuse existing FPGA elements
8
Outline
• Adapting the Hoplite arch. to the DSP48
• Scaling to 2D layouts — using DSP carry chains
• Performance and Resource evaluation
9
Outline
• Adapting the Hoplite arch. to the DSP48
• Scaling to 2D layouts — using DSP carry chains
• Performance and Resource evaluation
10
Overview of Hoplite switch organization
• NoC organised as a unidirectional torus
• Each switch has 2 inputs, 2 outputs into the network + PE connection
• Uses deflection routing — no buffering, no allocation, etc
from: Jan Gray11
Hoplite Internals
5LUT5
LUT
5LUT5
LUT5
LUT
6LUT
WPE E
S/PE
DOR Logicsel0 sel1,2
N
12
Hoplite summary• Bulk of the footprint from 5-LUT, 6-LUT blocks
— implement packet multiplexers
• DOR logic handful of LUTs — only reads address fields, valid signals
• Inter-Hoplite router links pipelined — registers
• Idea: move (1) multiplexers + (2) registers into Xilinx DSP48 block
5LUT5
LUT
5LUT5
LUT5
LUT
6LUT
WPE E
S/PE
DOR Logicsel0 sel1,2
N
13
Xilinx DSP48 block
A
D
B
C
30/
27/
18/
48/
P
48/
27/
48/
PCIN48/
ALU
X
Z
Y
PCOUTOPMODE ALUMODEINMODE
14
Xilinx DSP48 block
A
D
B
C
30/
27/
18/
48/
P
48/
27/
48/
PCIN48/
ALU
X
Z
Y
PCOUTOPMODE ALUMODEINMODE
15
Programmable elements
• Xilinx DSP block very versatile!
• Typical use case: signal processing, streaming computations => mainly arithmetic
• INMODE — 27b multiplexer between A and D OPMODE — 48b multiplexers between A:B, C
• Exploit cascade links PCIN/PCOUT!
A
D
B
C
30/
27/
18/
48/
P
48/
27/
48/
PCIN48/
ALU
X
Z
Y
PCOUTOPMODE ALUMODEINMODE
16
Input + Multiplexer Mapping
5LUT5
LUT
5LUT5
LUT5
LUT
6LUT
WPE E
S/PE
DOR Logicsel0 sel1,2
N
A
D
B
C
30/
27/
18/
48/
P
48/
27/
48/
PCIN48/
ALU
X
Z
Y
PCOUTOPMODE ALUMODEINMODE
17
Input + Multiplexer Mapping
5LUT5
LUT
5LUT5
LUT5
LUT
6LUT
WPE E
S/PE
DOR Logicsel0 sel1,2
N
A
D
B
C
30/
27/
18/
48/
P
48/
27/
48/
PCIN48/
ALU
X
Z
Y
PCOUTOPMODE ALUMODEINMODE
WEST
PE
N
S/PE
EAST
18
Input + Multiplexer Mapping
5LUT5
LUT
5LUT5
LUT5
LUT
6LUT
WPE E
S/PE
DOR Logicsel0 sel1,2
N
A
D
B
C
30/
27/
18/
48/
P
48/
27/
48/
PCIN48/
ALU
X
Z
Y
PCOUTOPMODE ALUMODEINMODE
WEST
PE
N
S/PE
EAST
19
Input + Multiplexer Mapping
5LUT5
LUT
5LUT5
LUT5
LUT
6LUT
WPE E
S/PE
DOR Logicsel0 sel1,2
N
A
D
B
C
30/
27/
18/
48/
P
48/
27/
48/
PCIN48/
ALU
X
Z
Y
PCOUTOPMODE ALUMODEINMODE
WEST
PE
N
S/PE
EAST
20
Multi-cycling
• Problem: Hoplite has two outputs (three in fact, with S/PE output port shared)
• Solution: must multi-pump the DSP block — runs at 2x the frequency of the PEs
• First sub-cycle — resolve EAST output
• Second sub-cycle — resolve SOUTH/PE output
21
First cycle
A
D
B
C
30/
27/
18/
48/
27/
PCIN48/
ALU
X
Z
Y
PCOUTOPMODE ALUMODEINMODE
PE Input
West Input
East Output
48/
P48/
CE
22
Second cycle
A
D
B
C
30/
27/
18/
48/
27/
PCIN48/
ALU
X
Z
Y
PCOUTOPMODE ALUMODEINMODE
PE Input
West Input
South/PE Output
48/
P48/
North Input
CE
23
Outline
• Adapting the Hoplite arch. to the DSP48
• Scaling to 2D layouts — using DSP carry chains
• Performance and Resource evaluation
24
DSP48 columnar layout
DSP48E
DSP48E
PCOUT
PCIN
A:B
C
P DSP48E
UserLogic
A:B
DSP48E
PCOUT
PCIN
DSP48E P
dedicatedcascade routes
programmable FPGA interconnectDSP
Column
DORLogic
25
Layout considerations• FPGA DSPs organised into vertical columns
~100s of DSPs in a column~10s of columns
• Restrictions:1. Cascade links only extend within column 2. Horizontal links must use general interconnect
• Key question: Adjusting NoC size vs. DSP count— use passthrough DSPs
26
Embedded layout
Hoplite
Hoplite
DSP48E
cascade
fabric
Hoplite
DSP48E
DSP48E
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplitefabric
Top-Turn DSPs PCIN to P
Bottom-Turn DSPs A:B to PCOUT
DSP48EDSP48E DSP48E DSP48E
Pass-thru DSPsPCOUT to PCIN
Pass-thru DSPsPCOUT to PCIN
Router DSPs
Router DSPs
Router DSPs
27
Comparing Xilinx Virtex6 and Virtex7 Layouts
8x8 NoC (ML605 board)
16x16 NoC (VC707 board)
28
Outline
• Adapting the Hoplite arch. to the DSP48
• Scaling to 2D layouts — using DSP carry chains
• Performance and Resource evaluation
29
LUTs vs DSPs
30
• Simple tradeoff— substantially fewer LUTs vs. DSP48s— Importantly, FFs absorbed into DSP48
• Power and effective B/W for random traffic mostly identical
LUTs vs DSPs
31
• Simple tradeoff— substantially fewer LUTs vs. DSP48s— Importantly, FFs absorbed into DSP48
• Power and effective B/W for random traffic mostly identical
Commentary on hard NoCs• Area:
— Hard router = 12.45 LABs— 1 Altera DSP block = 11.9 LABs Stratix-III— Hoplite-DSP marginally smaller
• Speed:— Hard router ~996 MHz— Hoplite-DSP ~650 MHz (multi-pumped)— Hoplite-DSP limits freq advantage to 3x.
• Power— Hard router ~1.58 W— Hoplite-DSP model ~1.1W 15% activity— Hoplite-DSP uses ~50% less power
32
Abdelfattah + Betz [TRETS2014](extrapolated results for 48b-wide 1VC)
Wish-list for DSP48s Gen2• Configurable Cascades
— 48b switched bidirectional routing instead of just cascades (approach hard NoC wiring)— option to skip DSP blocks (segment lengths)
• DOR routing— pattern detection logic with multiple masks (similar to Altera DSP units)
• SIMD Multiplexing — fracturing 48b-wide lanes into multiple lanes
33
Conclusions
• Hoplite muxes mapped to DSP48 blocks — use the dynamic OPMODE feature
• Reduce cost by 5x LUTs, 8x FFs per router
• Exploit cascade links to absorb NoC wiring
• Significantly close the gap with hard NoCs
34
Embedded layout• Three kinds of DSPs
• “Route DSPs” — Small fraction of DSPs for switching
• “Pass-through DSPs” — glorified “pipelined wires” — multi-pumping 50% back to user
• “Corner-turn DSPs”— connect cascades to fabric
Hoplite
Hoplite
DSP48E
cascade
fabric
Hoplite
DSP48E
DSP48E
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
fabric
Top-Turn DSPs PCIN to P
Bottom-Turn DSPs A:B to PCOUT
DSP48EDSP48E DSP48E DSP48E
Pass-thru DSPsPCOUT to PCIN
Pass-thru DSPsPCOUT to PCIN
Router DSPs
Router DSPs
Router DSPs
Hoplite
Hoplite
DSP48E
cascade
fabric
Hoplite
DSP48E
DSP48E
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
fabric
Top-Turn DSPs PCIN to P
Bottom-Turn DSPs A:B to PCOUT
DSP48EDSP48E DSP48E DSP48E
Pass-thru DSPsPCOUT to PCIN
Pass-thru DSPsPCOUT to PCIN
Router DSPs
Router DSPs
Router DSPs
Hoplite
Hoplite
DSP48E
cascade
fabric
Hoplite
DSP48E
DSP48E
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
fabric
Top-Turn DSPs PCIN to P
Bottom-Turn DSPs A:B to PCOUT
DSP48EDSP48E DSP48E DSP48E
Pass-thru DSPsPCOUT to PCIN
Pass-thru DSPsPCOUT to PCIN
Router DSPs
Router DSPs
Router DSPs
Hoplite
Hoplite
DSP48E
cascade
fabric
Hoplite
DSP48E
DSP48E
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
fabric
Top-Turn DSPs PCIN to P
Bottom-Turn DSPs A:B to PCOUT
DSP48EDSP48E DSP48E DSP48E
Pass-thru DSPsPCOUT to PCIN
Pass-thru DSPsPCOUT to PCIN
Router DSPs
Router DSPs
Router DSPs
Hoplite
Hoplite
DSP48E
cascade
fabric
Hoplite
DSP48E
DSP48E
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
fabric
Top-Turn DSPs PCIN to P
Bottom-Turn DSPs A:B to PCOUT
DSP48EDSP48E DSP48E DSP48E
Pass-thru DSPsPCOUT to PCIN
Pass-thru DSPsPCOUT to PCIN
Router DSPs
Router DSPs
Router DSPs
35
Physical FPGA layout
Hoplite
Hoplite
DSP48E
cascade
fabric
Hoplite
DSP48E
DSP48E
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
Hoplite
DSP48E
Hoplite
DSP48E
DSP48E
Hoplite
fabric
Top-Turn DSPs PCIN to P
Bottom-Turn DSPs A:B to PCOUT
DSP48EDSP48E DSP48E DSP48E
Pass-thru DSPsPCOUT to PCIN
Pass-thru DSPsPCOUT to PCIN
Router DSPs
Router DSPs
Router DSPs
2x2 NoC (ML605 board)
Corner-Turn
Pass-Thru
Hoplite
36
Efficiency
38
Efficiency
39
Efficiency
40
Efficiency
41
DSP48s less-efficient than LUT-based Hoplite!