Power Optimal Dual-V dd Buffered Tree Considering Buffer Stations and Blockages
description
Transcript of Power Optimal Dual-V dd Buffered Tree Considering Buffer Stations and Blockages
Power Optimal Dual-Vdd Buffered Tree Considering Buffer Stations and Blockages
King Ho Tam and Lei HeElectrical Engineering Department
University of California, Los Angeles
Sponsors: NSF CAREER, UC MICRO (Fujitsu, Intel and Mindspeed), and IBM Faculty Partner Award.
Motivation Increasing interconnect power
35% cells are buffers at 65nm technology [Saxena, TCAD 04]
Previous work Power-optimal single Vdd buffer insertion
[Lillis, JSSC 96] Delay-optimal buffered tree generation
[Cong, DAC 00; Alpert, TCAD 02]
No existing algorithms consider dual-Vdd for buffer insertion or buffered tree generation
Major Contributions
First in-depth study of dual Vdd buffer insertion and buffered tree generation Large power saving over single Vdd buffering
Efficient algorithms for power optimality 17x faster than [Lillis, JSSC 96] when single
Vdd is considered
Outline
Dual Vdd buffer insertion and sizing (DVB) Problem formulation Sampling for speedup Experimental results
Dual Vdd buffered tree generation (D-Tree) Problem formulation Improved augmented orthogonal search tree Experimental results
Delay, Slew and Power Modeling
Elmore delay Wire: , buffer:
Bakoglu’s slew metric (ln 9 ∙Elmore)
Power = energy per switch Wire: Lumped buffer dynamic/short-circuit power Can be easily extended to leakage power
Low Vdd (VL) reduces leakage Need to assume of clock rate and switching activity
lrclcld wloadw
21)( loadobuf crdd int
25.0 ddww VlcE
Introducing Dual Vdd Buffering
Achieves power saving since power α Vdd2
Suffer no loss of delay optimality
VL => VH requires level converter (LC) Restore voltage level and reduce leakage
Ext-CVS for logic [Srivastava, ISLPED 04] LC delay and power overhead amortized
VVL
VHV
I
Reduced noise margin
Leakage
VHV
I
Key Observation in Dual Vdd Buffering
Disallowing VL => VH will not affect optimality Optimality empirically illustrated (@ 65nm):
(a) has LC and VH drives Cl, power (a) > (b) Delay (b) > (a) only if Cl > 0.5pF (~ 9mm wire)
VH
VL
DVB Formulation
Dual Vdd Buffer Insertion (DVB) Given interconnect tree Find buffer placement, Vdd assignment for
buffers, sizes of buffers VH buffers driving VL buffers within the tree Level converters at VH sinks driven by VL buffers
Minimize power subject to Arrival time requirement at the source (RAT) Slew rate constraint at buffer inputs and sinks
DVB AlgorithmBased on [Lillis, JSSC 96]
Dynamic programming with partial solution (option) pruning
Options must now record downstream Vdd levels for buffering
To prevent VL => VH, which removes unnecessary search on solution space
Still quite slow for large nets
Challenge Considering power causes super-linear growth
in the number of options (w.r.t. tree size) Dual Vdd buffers => 2x options at each node
Speed-up Technique
Approximate by power-delay samplingSampling under each distinct cap value
Uniformly pick options from the entire RAT—power trade-off curve
Experimental Settings for DVB
Testcase: randomly generated Steiner trees 20 to 800 terminals in 1cm x 1cm routing area Buffer sizes: 16x, 32x, 64x
Sampling grid set to 20x20
Comparison Exact power-optimal algorithm (PB)
[Lillis, JSSC 96] Our algorithm with single (SVB) and dual
(DVB) Vdd buffers
Sampling Preserves Optimality
Sampling has little impact on optimality SVB follows PB closely Still optimal delay, 1.7% larger power over PB
Dual Vdd Reduces Power
Dual Vdd shifts power-delay curve to the left
Experimental Results for DVB
DVB saves 23% power over SVB More power saving in larger nets Power saving becomes larger w/delay slack
e.g. relax delay 5%, saving becomes 26%
Testcase Power (at optimal RAT) (fJ)Net # nodes # sinks SVB DVBS5 375 199 18699 13808 [-26%]S6 515 299 23443 17239 [-26%]S7 784 499 33552 23804 [-29%]S8 1054 699 38351 25799 [-33%]S9 1188 799 40228 26646 [-34%]avg [-23%]
Runtime
SVB scales a lot better for larger testcases Achieved 17x speedup over PB [Lillis, JSSC
96] DVB takes ~2.5x more runtime than SVB
Testcases Runtime (s)net # nodes # sinks PB SVB DVBS5 375 199 719 86 212S6 515 299 2121 139 371S7 784 499 33419 393 635S8 1054 699 > 1 day 598 1072S9 1188 799 > 1 day 853 1859avg 1x 1/17x 1/7x
Outline
Dual-Vdd Buffer insertion and sizing (DVB) Problem formulation “Sampling” speed-up technique Experimental results
Dual-Vdd buffered tree generation (D-Tree) Problem formulation Improved augmented orthogonal search tree Experimental results
D-Tree Formulation
Dual Vdd Buffered Tree (D-Tree) Given locations of terminals, buffer stations and
blockages Find a rectilinear Steiner tree (RST), buffer
placement/size/Vdd assignment VH buffers driving VL buffers only Level converters at VH sinks driven by VL buffers
Minimize power Arrival time requirement at the source (RAT) Slew rate constraint at buffer inputs and sinks
D-Tree is NP-Hard Finding minimum RST alone is NP-Complete
Buffered Tree Construction
Delay optimization only [Cong, DAC 00] by1. Build Hanan Graph w/buffer insertion nodes
according to locations of buffer stations2. Path search on the grid by option propagation
D-Tree Algorithm Overview
Challenges Growth of option is exponential
An artifact of D-Tree’s NP-hardness Considering power worsens option growth
Solution: sampling + efficient prune tree
Prune Tree in [Lillis, JSSC 96]
Option inserted in sorted capacitance Never need to clear options out from the tree
If new option is checked against the tree Automatically avoid redundant option in tree e.g. Фnew = (c = 20, p = 100, q = 600)
Not applicable to D-Tree problem Order of new options is not known a priori
c=20, q=600
c=10, q=500
c=8, q=400 c=15, q=550
c=12, q=520c=7, q=380
P=100
Our Improvement on Prune Tree
Indexing w/capacitance results in fewer trees # capacitance value < # power value
Efficient “tree cleaning” Enables out-of-order option insertion Guarantee no redundancy in tree
Tree Cleaning
To add an option Фnew in O(|c|·log(|T|)) time1. Check whether Фnew is dominated by any
option in the data-structure2. If not, remove options in the tree dominated
by Фnew in two downward tree traversals• e.g. Фnew = (c = 10, p = 70, q = 410, …)
Experimental Settings for D-Tree
Random testcases All based on a random floorplan of 1cm x 1cm Blockages ~ 30%, buffer stations ~1mm apart
Comparison Delay-optimal tree (RMP) [Cong, DAC 00] Ours with single (S-Tree) and dual
(D-Tree) Vdd Buffer
Experimental Results for D-Tree
Significant power saving over RMP S-Tree: 7%, D-Tree: 18% Larger saving for large testcases (e.g. T4)
Handles up to 6-sink nets (T5 takes 23 mins) Similar capability compared with delay-optimal
approaches [Cong, DAC 00; Chen, ASP-DAC 02]
Testcases Power @ optimal RAT (pJ)Net # nodes # sinks RMP S-Tree D-TreeT3 137 4 3.9 3.5 [-10%] 2.9 [-23%]T4 261 5 4.9 4.4 [-13%] 3.1 [-37%]T5 235 6 4.2 3.8 [-10%] 3.4 [-18%]avg -7% -18%
Conclusion Formulated dual Vdd buffer insertion/tree generation
without level converters
Proposed 2 speedup techniques “Sampling” w/negligible loss of optimality “Improved prune tree” for solution pruning
Applied to single-Vdd buffer insertion, 17x faster than existing work
Large power saving over single Vdd buffering 23% in buffer insertion: dual Vdd vs single Vdd 18% in buffered tree: dual Vdd vs delay optimal
Future Work
Speed up tree construction
Slack allocation for more power reduction Path-based buffer insertion
[Sze, DAC 05] Allocate slack along one interconnect path Consider single Vdd buffers only
Chip level FPGA dual Vdd assignment[Lin, DAC 05]
Fixed buffer location, assign Vdd levels Consider Multiple critical path Solved as a linear programming problem