Optimizing a High Performance 32-bit Processor for ...
Transcript of Optimizing a High Performance 32-bit Processor for ...
© 2004 Altera Corporation
Optimizing a High Performance 32-bit Processor for Programmable Logic
Optimizing a High Performance 32-bit Processor for Programmable Logic
Paul Metzgen16th November 2004
Paul Metzgen16th November 2004
2 © 2004 Altera Confidential ®
Agenda
System Design on FPGAs– Brief Overview of Altera’s SOPC Tools
Architecting Designs for FPGAs– Different Design Trade-offs
Case Study: The Design of Nios II– Implementing Multiplexers in FPGAs– Optimizing Multiplexers in Nios II
© 2004 Altera Corporation
System Design on FPGAsSystem Design on FPGAs
Overview of Altera’s SOPC ToolflowOverview of Altera’s SOPC Toolflow
5
4 © 2004 Altera Confidential ®
Altera’s SOPC Builder
Peripheral SetCan also add your own (eg:– custom peripherals,– accelerators)
5 © 2004 Altera Confidential ®
Altera’s SOPC Builder
Can specify system connectivity
RAM PIO
I-master D-master
6 © 2004 Altera Confidential ®
Altera’s SOPC Builder
Automatic Logic & Bus Generation
7 © 2004 Altera Confidential ®
Altera’s SOPC Builder
Automatic Device Driver Generation
8 © 2004 Altera Confidential ®
Nios II IDE
Terminal Terminal windowwindow
File File Viewer Viewer
WindowWindow
9 © 2004 Altera Confidential ®
SOPC Toolflow: Summary
10 © 2004 Altera Confidential ®
SOPC Toolflow: Summary
11 © 2004 Altera Confidential ®
SOPC Toolflow: Summary
12 © 2004 Altera Confidential ®
Nios II Family of Processors:
Pipeline
Br. Prediction
I$ - Cache
D$ - Cache
Performance
Size (LEs)
Econom
y
Standard
Fast
6-stage 5-stage 5-cycle
Dynamic Static
yes yes no
no
yes no no
7.5x 4.7x 1.0x
1800 1400 700
13 © 2004 Altera Confidential ®
0
50
100
150
200
250
300
$0.00 $0.50 $1.00 $1.50 $2.00 $2.50 $3.00 $3.50 $4.00 $4.50 $5.00
Cost of CPU Logic
Perf
orm
ance
(DM
IPS)
Processor Cost vs. Performance
Stratix
Cyclone
Stratix II
HardCopy® Stratix II
e
s
f
e
s
f
e
s
f
e
s
f
14 © 2004 Altera Confidential ®
Nios II Family of Processors:
Pipeline
Br. Prediction
I$ - Cache
D$ - Cache
Performance
Size (LEs)
Econom
y
Standard
Fast
6-stage 5-stage 5-cycle
Dynamic Static
yes yes no
no
yes no no
7.5x 4.7x 1.0x
1800 1400 700
© 2004 Altera Corporation
Architecting Designs for FPGAsArchitecting Designs for FPGAs
Different Design Trade-offsDifferent Design Trade-offs
10
16 © 2004 Altera Confidential ®
Making the most of the Available Resources
LUTLUT REGREG
Logic ‘Elements’ DSP Blocks
+
Opt
iona
l Pip
elin
ing
Out
put R
egis
ter U
nit
Out
put M
ultip
lexe
r
144 144
36
36
36
36
37
37
38
+ - Σ
+ - Σ
Inpu
t Reg
iste
r Uni
t
Memories
More Bits For Larger Memory Buffering
More Data Ports for Greater Memory Bandwidth
142 GMac/s
x180,000
5.1 Tbyte/s
17 © 2004 Altera Confidential ®
Relative Area Costs
Registers Medium
ASIC FPGA
Adders Medium
Multipliers High
Memory High
Multiplexers Low
+
4:1
D$
*
Area Cost
18 © 2004 Altera Confidential ®
Relative Area Costs
Registers Medium Low
ASIC FPGA
Adders Medium Low
Multipliers High
Memory High
Multiplexers Low
+
4:1
D$
*
Area Cost
Free Register with every Lookup Table
(independently accessible)
19 © 2004 Altera Confidential ®
Relative Area Costs
Registers Medium Low
ASIC FPGA
Adders Medium Low
Multipliers High Medium
Memory High Medium
Multiplexers Low
+
4:1
D$
*‘Hard’ Optimized
ASIC Blocks
Area Cost
Free Register with every Lookup Table
(independently accessible)
20 © 2004 Altera Confidential ®
Relative Area Costs
Registers Medium Low
ASIC FPGA
Adders Medium Low
Multipliers High Medium
Memory High Medium
Multiplexers Low High
+
4:1
D$
*‘Hard’ Optimized
ASIC Blocks
Implemented in Lookup Tables
Area Cost
Free Register with every Lookup Table
(independently accessible)
21 © 2004 Altera Confidential ®
Relative Area Costs
Registers Medium Low
ASIC FPGA
Adders Medium Low
Multipliers High Medium
Memory High Medium
Multiplexers Low High
+
4:1
D$
*‘Hard’ Optimized
ASIC Blocks
Implemented in Lookup Tables
Free Register with every Lookup Table
“The Key to Optimizing Designs for an FPGA …is to Optimize the Multiplexers”
© 2004 Altera Corporation
Architecting Designs for FPGAsArchitecting Designs for FPGAs
Barrel-Shifts using MultipliersBarrel-Shifts using Multipliers
23 © 2004 Altera Confidential ®
A Barrel-Shifter Using MultiplexersG HA B C D E F
Z6 Z7Z0 Z1 Z2 Z3 Z4 Z5
N log2N LEs
160 LEsfor a 32-bit Barrel Shifter
24 © 2004 Altera Confidential ®
Barrel Shifter Using Multipliers
G HA B C D E F
Z6 Z7Z0 Z1 Z2 Z3 Z4 Z5
W X Y Z
*
0000000100000
00000Sign W X Y Z
N
N
Multipliers High Medium*
ASIC FPGA
Multiplexers Low High4:1
Area Cost
25 © 2004 Altera Confidential ®00000Sign W X Y Z
W X Y Z
Shifters using Multipliers
*
0000000100000
00000Sign W X Y Z
N
N
Signed?
SHL (N)
26 © 2004 Altera Confidential ®00000Sign W X Y Z
00000Sign W X Y Z
W X Y Z
Shifters using Multipliers
*
0000000100000
00000Sign W X Y Z
N
N
Signed?
ASR (32-N)SHL (N)
27 © 2004 Altera Confidential ®
W X Y Z
Shifters using Multipliers
*
0000000100000
0000000000000 W X Y Z
N
N
Unsigned
00000Sign W X Y Z
0000000000000 W X Y ZSHR (32-N)
SHL (N)
28 © 2004 Altera Confidential ®00000Sign W X Y Z W X Y Z
W X Y Z
Shifters using Multipliers
*
0000000100000
0000000000000 W X Y Z
N
N
Unsigned
ROT (N)
29 © 2004 Altera Confidential ®
ASR (32-N)
W X Y Z
Shifters using Multipliers
0000000100000
N
Signed?
SHR (32-N)
ROT (N)
SHL (N)MULLOW
MULHIGH
*
3:1
© 2004 Altera Corporation
Case Study: The Design of Nios IICase Study: The Design of Nios II
The ALUThe ALU
15
31 © 2004 Altera Confidential ®
ALU
Case Study:The NIOS II Pipeline
I$
2:1
RFa RFb
RFbRFa
Instruction Immediate
External Memory
Read
Alu Result
2:1
32 © 2004 Altera Confidential ®
ALU
The NIOS II Pipeline I$
3:1
RFa RFb
RFbRFa
*
Instruction Immediate
External Memory
Read
Alu Result
2:1
33 © 2004 Altera Confidential ®
ALU
The NIOS II Pipeline I$
3:1
RFa RFb
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
2:1
Alu Result
Multiplier is used forBarrel-Shifts as well
as Multiplication
34 © 2004 Altera Confidential ®
ALU
The NIOS II Pipeline I$
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
Alu Result
Data Cache Read
2:1
35 © 2004 Altera Confidential ®
The Logic Unit I$
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
Alu Result
Data Cache Read
2:1
+/-logic
2:14:1
4 LUT
36 © 2004 Altera Confidential ®
The Arithmetic Unit I$
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
Alu Result
Data Cache Read
2:1
+/-logic
2:1
37 © 2004 Altera Confidential ®
The Comparator Unit I$
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
Alu Result
Data Cache Read
2:1
+/-logic
>/=
3:1
CMP.op r3, r2, r1IF (r2 op r1)
THEN R3 = 0x00000001ELSE R3 = 0x00000000
Nios II has no explicit Flags
38 © 2004 Altera Confidential ®
Return Address Save I$
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
Alu Result
Data Cache Read
2:1
+/-logic
>/=
ReturnAddress
4:1CALLTRAP
INTERUPTBREAK
Return Address is saved in a Link
Register
Instructions that save Return Address
© 2004 Altera Corporation
Case Study: The Design of Nios IICase Study: The Design of Nios II
Increasing the Clock RateIncreasing the Clock Rate
40 © 2004 Altera Confidential ®
The NIOS II Pipeline
+/-
I$
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
2:1
Pipeline to achieve a high Clock Rate
(fmax)
41 © 2004 Altera Confidential ®
Forwarding Logic
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
ADD R2, R1, R0
MUL R4, R3, R2
Fowarding needed to update out-of-date
values in the pipeline
new
© 2004 Altera Corporation
Case Study: The Design of Nios IICase Study: The Design of Nios II
The Cost of MultiplexersThe Cost of Multiplexers
20
43 © 2004 Altera Confidential ®
NIOS II Multiplexers
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
44 © 2004 Altera Confidential ®
What is the Cost of a Multiplexer…?
5:14:12:1 3:1 6:1
Binary (2:1)
Natural Implementation Choice for an ASIC
45 © 2004 Altera Confidential ®
NIOS II Multiplexers
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
4
2
2
1
3
5
544 LEs(17 x 32bits)
46 © 2004 Altera Confidential ®
NIOS II Multiplexers I$
5:1 6:1
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
1
1
544 LEs(17 x 32bits)
+/-logic
>/=78 LEs
Multiplexer Cost is
Dominant
<1
47 © 2004 Altera Confidential ®
Area Usage in 100 Customer Designs
Muxes26%
Arithmetic(+,<,=)11%
Wide-AND11%
Wide-XOR3%
Lonely-Reg18%
Other31%
MuxesArithmetic(+,<,=)Wide-ANDWide-XORLonely-RegOther
Many Designs contain lots of Multiplexers !
© 2004 Altera Corporation
Multiplexers in FPGAMultiplexers in FPGA
Low-Cost MultiplexersLow-Cost Multiplexers
49 © 2004 Altera Confidential ®
Efficient 4:1 Mux on Stratix
C DA B
S1S0
Uses just
2 LEs.
50 © 2004 Altera Confidential ®
Efficient 4:1 Mux on Stratix: How it works
C DA B
C/D0
C DA B
A/B1
1 0
1 0
0 1
0 1
1 0
0 1
0 1
1 0
51 © 2004 Altera Confidential ®
The Improved Cost of Binary Multiplexers
5:14:12:1 3:1 6:1
Binary (4:1)
Selector
4:1
4:1 4:1
4:1
52 © 2004 Altera Confidential ®
The Improved Cost of Binary Multiplexers
5:14:12:1 3:1 6:1
Binary (4:1)
4:1
4:1 4:1
4:1
Selector
1 2 3 43
1 2 3 42
53 © 2004 Altera Confidential ®
Efficient Multiplexers
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
3
2
2
1
2
4
448 LEs(14 x 32bits)
544 LEs(17 x 32bits)
-18%
© 2004 Altera Corporation
Multiplexers in FPGAMultiplexers in FPGA
Registered MultiplexersRegistered Multiplexers
25
55 © 2004 Altera Confidential ®
Efficient Multiplexers
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
3
2
2
1
2
4 3
416 LEs(13 x 32bits)
544 LEs(17 x 32bits)
-24%
Multiplexer costs can be reducedusing a register!
56 © 2004 Altera Confidential ®
The Stratix LE
57 © 2004 Altera Confidential ®
The Stratix LE
enable
sload sclear
Additional Lab-wide signals(shared between 8 LEs)
58 © 2004 Altera Confidential ®
2:1 Mux in 1 LE
d0 d1sel
59 © 2004 Altera Confidential ®
3:1 Mux in 1 LE
d0 d1 d2
Sync-loadRegister Needed(for sload)
sel
60 © 2004 Altera Confidential ®
4:1 Mux in 1 LE
d0 d1 d2
sload
Register Needed(for sload / sclear)
sel
0sclear
0
61 © 2004 Altera Confidential ®
The Cost of Multiplexers
1 2 3 42
5:14:12:1 3:1 6:1
1 1 3 31-2
5:14:12:1 3:1 6:1
Asynchronous
Registered
62 © 2004 Altera Confidential ®
The Most Cost Effective Multiplexers
Asynchronous
1 2 3 42
5:14:12:1 3:1
Registered
1 1 31-2
5:14:12:1 3:1 6:1
3
6:1
63 © 2004 Altera Confidential ®
The Most Cost Effective Multiplexers
Asynchronous
1 2 3 42
5:14:12:1 3:1
Registered
1 1 31-2
5:14:12:1 3:1 6:1
6:1
3
64 © 2004 Altera Confidential ®
Recap:
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
3
2
2
1
2
4 3
416 LEs(13 x 32bits)
544 LEs(17 x 32bits)
-24%
Multiplexer costs were reduced using
a register!
© 2004 Altera Corporation
Optimizing Multiplexers in Nios IIOptimizing Multiplexers in Nios II
Restructuring TechniquesRestructuring Techniques
30
66 © 2004 Altera Confidential ®
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
3
2
2
1
2
32:1 3:1
1 1
Registered
Underutilized Muxes
Can extend 2:1 to be a 3:1 at no
extra cost!
67 © 2004 Altera Confidential ®
Input Balancing:
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
2
1
2:1 3:1
1 1
Registered
1 2
Async
2:1 3:1
68 © 2004 Altera Confidential ®
NIOS II Multiplexers
+/-
I$
5:1 6:1
logic
3:1
2:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
1 2
1
384 LEs(12 x 32bits)
416 LEs(13 x 32bits)
-8%
69 © 2004 Altera Confidential ®
Related Inputs:
+/-
I$
5:1 6:1
logic
3:1
2:1
RFa RFb
D$
RFbRFa
3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
352 LEs(11 x 32bits)
*1 2
2:1
2:1
* * 5-LUT
4-LUT
70 © 2004 Altera Confidential ®
Design Trade-offs
+/-
I$
5:1 6:1
logic
2:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
CALLTRAPINTR
BREAK
3333
cycles
No need to Forward Return Address Early
3:1
71 © 2004 Altera Confidential ®
Design Trade-offs
+/-
I$
5:1 6:1
logic
3:1
2:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
3:1
CALLTRAPINTR
BREAK
3333
cycles
No need to Forward Return Address Early
72 © 2004 Altera Confidential ®
Forwarding Zero… I$
5:1 6:1
3:1
2:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
3:1
Can use Synchronous Reset instead of
multiplexer input.
CMP.op r3, r2, r1
+/-logic
>/=
IF (r2 op r1) THEN R3 = 0x00000001ELSE R3 = 0x00000000
Mostly 0’s
73 © 2004 Altera Confidential ®
Forwarding Zero…
+/-
I$
logic
3:1
2:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
2:1
5:1 6:1
2 1
Can use Synchronous Reset instead of
multiplexer input.
CMP.op r3, r2, r1IF (r2 op r1)
THEN R3 = 0x00000001ELSE R3 = 0x00000000
Mostly 0’s
© 2004 Altera Corporation
Optimizing Multiplexers in FPGAOptimizing Multiplexers in FPGA
SummarySummary
35
75 © 2004 Altera Confidential ®
Summary: Restructure to 4:1 or 3:1(reg)
Asynchronous
1 2 3 42
5:14:12:1 3:1
Registered
1 1 31-2
5:14:12:1 3:1 6:1
6:1
3
Optimal Multiplexer Densities
76 © 2004 Altera Confidential ®
Summary
3:1
2:1
3:1
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
2:1
5:1 6:1
+/-
I$
logic
RFa RFb
D$
RFbRFa
*
>/=
320 LEs(10 x 32bits)
544 LEs(17 x 32bits)
- 42%
77 © 2004 Altera Confidential ®
Techniques Extend to Real Designs…
D 13,472
Size
67 MHz
SpeedOriginal
-60% unchng
Size SpeedOptimized
A 2,400 40 MHz -50% 2.5x
B 7,373 77 MHz -77% 2.0x
E 1,925 75 MHz -27% unchng
Others … … … …
C 13,500 50 MHz 1c12 fit 1.5x
© 2004 Altera Corporation
Optimizing Multiplexers in FPGAOptimizing Multiplexers in FPGA
Support in Quartus SynthesisSupport in Quartus Synthesis
79 © 2004 Altera Confidential ®
New Multiplexer Report:
(Table is always produced after Analysis & Synthesis, even if optimizations are disabled)
80 © 2004 Altera Confidential ®
New Multiplexer Report:
(Table is always produced after Analysis & Synthesis, even if optimizations are disabled)
– Number of Unique (or Constant) Inputs– Number of busses with identical structure
81 © 2004 Altera Confidential ®
New Multiplexer Report:
(Table is always produced after Analysis & Synthesis, even if optimizations are disabled)
– Estimate of Area Inefficiency
82 © 2004 Altera Confidential ®
New Synthesis Option:
83 © 2004 Altera Confidential ®
Results: (Stratix I: Logic Reduction)Stratix I QOR Set, LEs Post Synthesis
-10%
-5%
0%
5%
10%
15%
20%
25%se
ibus
_sw
itch
topl
evel
netw
orki
nter
face
mas
terfp
gaal
t_ra
pidi
o2fu
jitsu
crc3
2x32
bfyx
_top
quat
trofa
ust
cht
unpa
cker
_top
tdm
_phy
_top
tsi_
top
hda_
top
band
_fil
fldp
oops
corr
_409
6m
bcid
_top
msb
_asi
crm
on_c
hip
yang
tze
aqui
la_c
ore
sraa
tcp_
fpga
2al
t_bd
ti80
noki
a_fil
ter
me1
_cor
rect
edac
s_ge
nera
tor
oc_d
es_p
erf_
opt
siriu
sch
ip_f
icon
_40
coeu
r_op
logi
c_co
rede
m_c
ode
mbc
b
Design
%ag
e R
educ
tion
Mean = 4.2% (geo)
(preliminary)
Over 20% Area Reduction in Benchmark Set!
© 2004 Altera Corporation
SummarySummary
40
85 © 2004 Altera Confidential ®
SummarySystem Design on FPGAs– Low cost easy-to-use tools with Time-to-Market advantage
Architecting Designs for FPGAs– Multiplexer Costs can dominate in FPGAs
• 25% of the area on average• Significant in Processor / Busses
– FPGA Multiplexer Costs do not scale linearly• best to map to 4:1 or 3:1(reg)• Registers can reduce multiplexer costs!
– The Cheapest Multiplexers are those not implemented in Logic!• Eg: By using a multiplier
Synthesis Tools assist in Optimization Process– But the Designer still has a huge influence on QoR
3:14:1
© 2004 Altera Corporation
The End.The End.
Questions?Questions?