Device and Architecture Co-Optimization for FPGA Power Reduction
description
Transcript of Device and Architecture Co-Optimization for FPGA Power Reduction
Device and Architecture Co-Optimization for FPGA Power Reduction
Device and Architecture Co-Optimization for FPGA Power Reduction
Lerong Cheng, Phoebe Wong,
Fei Li, Yan Lin,
and Prof. Lei He
EE Department, UCLA
Partially supported by NSF CAREER award CCR-0093273/0401682 and NSF grant CCR-0306682.
Address comments to [email protected]
OutlineOutline
Background and motivation
Trace-based power and delay estimation
Device and architecture co-optimization
Conclusion
Evaluation of Conventional FPGA ArchitectureEvaluation of Conventional FPGA Architecture
LUT size and cluster size have been evaluated for conventional FPGA performance and area [Ahmed et al, ISFPGA’00] power and performance [Li et al, ISFPGA ‘03] Architecture tuning leads to 2.8X energy difference and
1.5X delay difference
Logic blockI/O pad
Switch box Connection box
3
4
5
6
7
8
9
9 10 11 12 13 14 15 16 17
Critical Path Delay (ns)
To
tal
FP
GA
En
erg
y (
nJ/
cycl
e)
(8, 7)
(6, 7)
(6, 6)
(10, 5)(8, 5)
(12, 4)
(6, 5)
(8, 4)
(6, 4)
(10, 4)
(8, 6)(12, 5)
(10, 6)
(12, 6)
(10, 7)
(12, 7)
(10, 3)(12, 3)
(8, 3)
(6, 3)
Island style FPGA architecture Evaluation result
Evaluation of Low-Power FPGA ArchitectureEvaluation of Low-Power FPGA Architecture
Field programmable dual-vdd for power reduction [Lin et al, ISFPGA’05] Applying field programmable dual Vdd reduces energy-delay
product by 49%
High Vdd Logic blockLow Vdd logic block
Vdd programmable logic block
Conventional FPGA
Vdd programmable FPGA
Evaluation MethodologyEvaluation Methodology
Parasitic Extraction
Cycle-accuratePower
Simulator(Psim)
Power
Arch Spec
Logic Optimization(SIS)
Tech-Mapping (RASP)
Timing-Driven Packing (TV-Pack)
Placement & Routing (VPR)
DelayArea
Benchmark circuits
Impact of Device TuningImpact of Device Tuning
All the previous work only considers architecture tuning
Device tuning leads to 84X power difference and 12X delay difference
It is necessary to perform device tuning and architecture tuning simultaneously
Challenge of Device and Architecture Co-Optimization Challenge of Device and Architecture Co-Optimization
We consider the following architecture and device parameters during our co-optimization: Architecture parameters:
Cluster size (N) LUT size (K)
Device parameters: Supply voltage (Vdd) Threshold voltage (Vt)
Hyper-architecture (hyper-arch) is the combination of the device and architecture parameters.
Large number of hyper-arch combinations
VPR and Psim are too slow to deal with such large numberof experiments
Need fast yet accurate power and delay estimation
OutlineOutline
Back ground and motivation
Trace-based power and delay estimation Trace collection Trace based power and delay model Accuracy and efficiency verification of Trace-based estimator
Device and architecture co-optimization
Conclusion
Trace CollectionTrace Collection
VPR andPsim
PtraceShort circuit power ratio
Circuit element statistics
Switching activity
Critical path structure
Assume trace information will remain the same when device setting changes
Area
Trace
Trace Base Estimation (Ptrace) FrameworkTrace Base Estimation (Ptrace) Framework
Trace
Ptrace
Chip level delay,
power, and areaCircuit level
delay and power
Device independent
Device dependent
OutlineOutline
Back ground and motivation
Trace-based power and delay estimation Trace collection Trace based power and delay model Accuracy and efficiency verification of Trace-based estimator
Device and architecture co-optimization
Conclusion
Delay Model in VPRDelay Model in VPR
Delay is calculated for each path as
Nip is number of type i elements in the path and Di is delay of
type i element Delay of the logic elements is measured by SPICE simulation Elmore delay is used for interconnect wire segments
Critical path is the path with longest delay
i
ipi DND
Delay in PtraceDelay in Ptrace
Obtain the path structure of a set of longest circuit paths
Assume that when device setting changes, the new critical path is still among the set of longest paths.
Delay computation:
i
ipi DND
Trace information
Device dependent parameters
Dynamic Power ModelDynamic Power Model
Psim Switch power
Switching activity is measured by timing simulation for each node Si is the average switching activity
Short circuit power
αsc is calculated for each node
Ptrace Switch power
Short circuit power
αsc is the average short circuit power ratio for the whole circuit
n
iiiddsw SCVfP
1
2
2
1
n
iiiddsw SCVfP
1
2
2
1
Trace information
Device dependent parameters
)( rscswsc tPP
scswsc PP
Static Power ModelStatic Power Model
Psim Without power gating
With power gating
Ptrace Without power gating
With power gating
i
itistatic PNP
i
iui
tigating
ii
uistatic PNNPNP )(
i
itistatic PNP
i
iui
tigating
ii
uistatic PNNPNP )(
Trace information
Device dependent parameters
OutlineOutline
Back ground and motivation
Trace-based power and delay estimation Trace collection Trace based power and delay model Accuracy and efficiency verification of Trace-based estimator
Device and architecture co-optimization
Conclusion
Experiment SettingExperiment Setting
Collect trace using ITRS 70nm technology, but apply to both 100nm and 70nm technologies
20 MCNC benchmarks
Assume each benchmark works in its highest possible frequency
Power and delay are computed as geometric mean of20 benchmarks.
Evaluation range
Vdd Vt LUT size (K) Cluster size (N)
0.8~1.1 0.2~0.4 3~7 6~12
AccuracyAccuracy
Average power error is 3.4%.
Average delay error is 6.4%. Delay error is due to Ptrace ignores the impact of path branches
that considered in VPR
RuntimeRuntime
VPR and Psim for one device setting five days on eight 1.2GHz Intel Xeon servers
Ptrace for 20 device settings 80 seconds on one 1.2GHz Intel Xeon server
OutlineOutline
Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization Energy and delay tradeoff ED and area tradeoff Comparison between classes Comparison between device tuning and architecture tuning
Conclusion
Architectures Classes to be EvaluatedArchitectures Classes to be Evaluated
Hyper-architecture classes
Baseline case
Vdd suggested by ITRS Architecture same as Xilinx Virtex-II™. Vt optimized by our method with respect to the above
architecture and Vdd
Hyper-arch classes Vt
Homo-Vt Homogeneous Vt
Hetero-Vt Heterogeneous Vt
Homo-Vt+G Homogeneous Vt + Power Gating
Hetero-Vt+G Heterogeneous Vt + Power Gating
Vdd Vt LUT size (K) Cluster size (N)
0.9 0.3 4 8
OutlineOutline
Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization Energy and delay tradeoff ED and area tradeoff Comparison between classes Comparison between device tuning and architecture tuning
Conclusion
Energy and Delay TradeoffEnergy and Delay Tradeoff
Dominant hyper-arch Hyper-arch B is inferior to A if A has less energy and smaller delay than B. Dominant hyper-archs (dom-arch) are the hyper-archs that are NOT inferior to
any other hyper-archs.
0.5
1
1.5
2
2.5
3
8 13 18 23 28Delay (ns)
En
erg
y p
er
cy
cle
(n
J)
Homo-VtHetero-VtHomo-Vt+GHetero-Vt+G
0.5
1
1.5
2
2.5
3
8 13 18 23 28Delay (ns)
En
erg
y p
er
cy
cle
(n
J)
Homo-VtHetero-VtHomo-Vt+GHetero-Vt+G
Energy and Delay TradeoffEnergy and Delay Tradeoff
Hetero-Vt can reduce power
Power gating reduces more leakage power than hetero-Vt
Hetero-Vt has less impact when power gating is applied
Min-ED Hyper-ArchMin-ED Hyper-Arch
Hyper-arch classes
Vdd (V)
CVt (V)
IVt (V) (N, K) ED
(nJ·ns)ED reduction
%
Baseline 0.9 0.3 0.3 (8,4) 26.9 -
Homo-Vt 0.9 0.3 0.3 (6,7) 23.3 13.4
Hetero-Vt 0.9 0.2 0.25 (8,4) 21.4 20.5
Homo-Vt+G 0.9 0.25 0.25 (12,4) 11.1 58.9
Hetero-Vt+G 0.9 0.2 0.25 (8,4) 11 59.0
To achieve the best energy and delay tradeoff, we find out the hyper-arch with the minimum energy and delay product (ED) Compared to the baseline, the min-ED hyper-arch of the
conventional FPGA (Homo-Vt) reduces ED by 13.4% For the Hetero-Vt class, ED is reduced by 20.5% If power gating is applied, ED can be reduced by up to 59.0%
OutlineOutline
Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization Energy and delay tradeoff ED and area tradeoff Comparison between classes Comparison between device tuning and architecture tuning
Conclusion
ED and area Tradeoff ED and area Tradeoff
Architecture tuning has great impact on area.
To achieve the best area and ED tradeoff, we find the hyper-arch with the minimum product of area, energy and delay (AED)
ED Area Tradeoff for Classes without Power GatingED Area Tradeoff for Classes without Power Gating
Compared to the min-ED hyper arch, the min-AED hyper-arch significantly reduce area with a small ED increase
70
90
110
130
150
170
78 80 82 84 86 88 90 92 94Normalized ED
No
rmali
zed
Are
aHomo-VtHetero-Vt
Min AED hyper-arch for Class1
Min AED hyper-arch for Class2
A1-1:{0.9, 0.3, 0.3, 6, 7 }A1-2:{1.0, 0.3, 0.3, 6, 4 }A1-3:{0.9, 0.3, 0.3, 12, 4 }A2-1:{0.9, 0.3, 0.25, 8, 5 }A2-2:{0.9, 0.3, 0.25, 12, 4 }
A2-1
A2-2
A1-1
A1-2A1-3
Sleep Transistor Size Tuning Sleep Transistor Size Tuning
When Power gating is applied, sleep transistors may increase area
The larger the sleep transistor size, the smaller the delay
Sleep transistor size tuning: Area overhead introduced by sleep transistors of
logic blocks is negligible. We consider 2X, 4X, 7X and 10X PMOS as sleep transistor for
switch buffer
ED Area Tradeoff for Classes with Power GatingED Area Tradeoff for Classes with Power Gating
The area reduction achieved by device and architecture co-optimization compensates the area overhead introduced by sleep transistors
90
110
130
150
40 41 42 43 44 45 46 47Normalized ED
Nor
mal
ized
Are
a
Homo-Vt+GHetero-Vt+G
A4-1
A4-2
A4-3
A4-4A4-5
A4-6
A3-1
A3-2
A3-3
A3-4
A3-1:{0.9, 0.25, 0.25, 12, 4, G2 }A3-2:{0.9, 0.25, 0.25, 12, 4, G4 }A3-3:{0.9, 0.25, 0.25, 12, 4, G7 }A3-4:{0.9, 0.25, 0.25, 12, 4, G10}A4-1:{0.9, 0.2, 0.25, 12, 4, G2 }A4-2:{0.9, 0.2, 0.25, 12, 4, G4 }A4-3:{0.9, 0.2, 0.25, 10, 4, G4 }A4-4:{0.9, 0.2, 0.25, 6, 4, G7 }A4-5:{0.9, 0.2, 0.25, 12, 4, G7 }A4-6:{0.9, 0.2, 0.25, 8, 4, G10}
Min AED hyper-arch for Class4
Min AED hyper-arch for Class3
Min-AED Hyper-ArchMin-AED Hyper-Arch
Vdd (V)
CVt (V)
IVt (V)
(N,K)Sleep
transistor sizeED
(nJ·ns) Normalized
areaAED
reduction %
Baseline 0.9 0.30 0.30 (8,4) - 26.9 1.00 -
Homo-Vt 1.0 0.30 0.30 (6,4) - 23.6 0.80 30.0
Hetero-Vt 0.9 0.30 0.25 (12, 4) - 21.3 0.77 40.0
Hetero-Vt+G 0.9 0.25 0.25 (12, 4) 2 12.4 0.92 57.6
Hetero-Vt+G 0.9 0.20 0.25 (12, 4) 2 12.2 0.92 58.3
Compared to the baseline, the min-AED hyper-arch in the conventional FPGA class can reduce area by 20% and ED by 12.3%
In the Hetero-Vt class, ED is reduced by 20.8% and area is reduced by 23% compared to the baseline
If power gating is applied, ED is reduced by 54.6% and area is reduced by 8.3%
OutlineOutline
Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization Energy and delay tradeoff ED and area tradeoff Comparison between classes Comparison between device tuning and architecture tuning
Conclusion
Comparison Between Classes in Similar Performance Range Comparison Between Classes in Similar Performance Range
Homo-Vt Hetero-Vt
Vdd Vt (N, K) E (nJ) D (ns)ED
(nJ·ns) Vdd CVt IVt (N, K) E (nJ) D (ns)ED
(nJ·ns)
0.9 0.30 6,6 1.33 18.6 24.8 0.9 0.3 0.35 6,4 1.16 20.1 23.3
0.9 0.3 10,5 1.27 19.8 25 0.9 0.3 0.35 12,4 1.14 20.5 23.7
0.9 0.3 6,4 1.23 21.6 26.5 0.9 0.3 0.35 8,4 1.09 22.1 24.1
Homo-Vt+G Hetero-Vt+G
Vdd Vt (N, K) E (nJ) D (ns)ED
(nJ·ns) Vdd CVt IVt (N, K) E (nJ) D (ns)ED
(nJ·ns)
0.8 0.25 10,5 0.70 19.4 13.7 0.9 0.25 0.3 12,4 0.66 18.9 12.5
0.8 0.25 8,4 0.62 20.9 12.9 0.8 0.25 0.25 8,4 0.62 20.9 12.9
0.8 0.25 12,4 0.62 21 12.9 0.8 0.25 0.25 12,4 0.62 21 12.9
Vt for logic block is lower than Vt for interconnect
Vt for classes with power gating is lower
OutlineOutline
Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization Energy and delay tradeoff ED and area tradeoff Comparison between classes Comparison between device tuning and architecture tuning
Conclusion
0
1
2
3
4
5
6
7
8
9
10
7 12 17 22 27 32 37 42
Delay (ns)
Ene
rgy
per
Cyc
le (n
J)
D1 Vdd 0.9 Vt 0.25D2 Vdd 0.9 Vt 0.30D3 Vdd 0.9 Vt 0.35D4 Vdd 1.0 Vt 0.25D5 Vdd 1.0 Vt 0.30D6 Vdd 1.0 Vt 0.35D7 Vdd 1.1 Vt 0.30D8 Vdd 1.1 Vt 0.35D8
D7
D6
D4
D5
D3
D1 D2
Dom-Archs under Different Device SettingsDom-Archs under Different Device Settings
For a given device setting architecture tuning changes delay and energy in a smaller range
Device tuning has a much more impact on delay and energy
OutlineOutline
Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization
Conclusion
Conclusion and DiscussionConclusion and Discussion
Trace-based estimator provides efficient and accurate FPGA power and delay estimation Average power error is 3.4% and average delay error is 6.1%
Device and architecture co-optimization reduces ED by 20.5% and area by 23.3% when there is no power gating
With power gating, device and architecture co-optimization reduces ED by 54.6% and area by 8.3%
Device tuning has a more significant impact on delay and power than architecture tuning does
In recent research, Ptrace has been extended to consider leakage and timing yield with process variations