Power Optimal Dual-V dd Buffered Tree Considering Buffer Stations and Blockages

Power Optimal Dual-Vdd Buffered Tree Considering Buffer Stations and Blockages

King Ho Tam and Lei HeElectrical Engineering Department

University of California, Los Angeles

Sponsors: NSF CAREER, UC MICRO (Fujitsu, Intel and Mindspeed), and IBM Faculty Partner Award.

Motivation Increasing interconnect power

35% cells are buffers at 65nm technology [Saxena, TCAD 04]

Previous work Power-optimal single Vdd buffer insertion

[Lillis, JSSC 96] Delay-optimal buffered tree generation

[Cong, DAC 00; Alpert, TCAD 02]

No existing algorithms consider dual-Vdd for buffer insertion or buffered tree generation

Major Contributions

First in-depth study of dual Vdd buffer insertion and buffered tree generation Large power saving over single Vdd buffering

Efficient algorithms for power optimality 17x faster than [Lillis, JSSC 96] when single

Vdd is considered

Outline

Dual Vdd buffer insertion and sizing (DVB) Problem formulation Sampling for speedup Experimental results

Dual Vdd buffered tree generation (D-Tree) Problem formulation Improved augmented orthogonal search tree Experimental results

Delay, Slew and Power Modeling

Elmore delay Wire: , buffer:

Bakoglu’s slew metric (ln 9 ∙Elmore)

Power = energy per switch Wire: Lumped buffer dynamic/short-circuit power Can be easily extended to leakage power

Low Vdd (VL) reduces leakage Need to assume of clock rate and switching activity

lrclcld wloadw

21)( loadobuf crdd int

25.0 ddww VlcE

Introducing Dual Vdd Buffering

Achieves power saving since power α Vdd2

Suffer no loss of delay optimality

VL => VH requires level converter (LC) Restore voltage level and reduce leakage

Ext-CVS for logic [Srivastava, ISLPED 04] LC delay and power overhead amortized

VVL

VHV

I

Reduced noise margin

Leakage

VHV

I

Key Observation in Dual Vdd Buffering

Disallowing VL => VH will not affect optimality Optimality empirically illustrated (@ 65nm):

(a) has LC and VH drives Cl, power (a) > (b) Delay (b) > (a) only if Cl > 0.5pF (~ 9mm wire)

VH

VL

DVB Formulation

Dual Vdd Buffer Insertion (DVB) Given interconnect tree Find buffer placement, Vdd assignment for

buffers, sizes of buffers VH buffers driving VL buffers within the tree Level converters at VH sinks driven by VL buffers

Minimize power subject to Arrival time requirement at the source (RAT) Slew rate constraint at buffer inputs and sinks

DVB AlgorithmBased on [Lillis, JSSC 96]

Dynamic programming with partial solution (option) pruning

Options must now record downstream Vdd levels for buffering

To prevent VL => VH, which removes unnecessary search on solution space

Still quite slow for large nets

Challenge Considering power causes super-linear growth

in the number of options (w.r.t. tree size) Dual Vdd buffers => 2x options at each node

Speed-up Technique

Approximate by power-delay samplingSampling under each distinct cap value

Uniformly pick options from the entire RAT—power trade-off curve

Experimental Settings for DVB

Testcase: randomly generated Steiner trees 20 to 800 terminals in 1cm x 1cm routing area Buffer sizes: 16x, 32x, 64x

Sampling grid set to 20x20

Comparison Exact power-optimal algorithm (PB)

[Lillis, JSSC 96] Our algorithm with single (SVB) and dual

(DVB) Vdd buffers

Sampling Preserves Optimality

Sampling has little impact on optimality SVB follows PB closely Still optimal delay, 1.7% larger power over PB

Dual Vdd Reduces Power

Dual Vdd shifts power-delay curve to the left

Experimental Results for DVB

DVB saves 23% power over SVB More power saving in larger nets Power saving becomes larger w/delay slack

e.g. relax delay 5%, saving becomes 26%

Testcase Power (at optimal RAT) (fJ)Net # nodes # sinks SVB DVBS5 375 199 18699 13808 [-26%]S6 515 299 23443 17239 [-26%]S7 784 499 33552 23804 [-29%]S8 1054 699 38351 25799 [-33%]S9 1188 799 40228 26646 [-34%]avg [-23%]

Runtime

SVB scales a lot better for larger testcases Achieved 17x speedup over PB [Lillis, JSSC

96] DVB takes ~2.5x more runtime than SVB

Testcases Runtime (s)net # nodes # sinks PB SVB DVBS5 375 199 719 86 212S6 515 299 2121 139 371S7 784 499 33419 393 635S8 1054 699 > 1 day 598 1072S9 1188 799 > 1 day 853 1859avg 1x 1/17x 1/7x

Outline

Dual-Vdd Buffer insertion and sizing (DVB) Problem formulation “Sampling” speed-up technique Experimental results

Dual-Vdd buffered tree generation (D-Tree) Problem formulation Improved augmented orthogonal search tree Experimental results

D-Tree Formulation

Dual Vdd Buffered Tree (D-Tree) Given locations of terminals, buffer stations and

blockages Find a rectilinear Steiner tree (RST), buffer

placement/size/Vdd assignment VH buffers driving VL buffers only Level converters at VH sinks driven by VL buffers

Minimize power Arrival time requirement at the source (RAT) Slew rate constraint at buffer inputs and sinks

D-Tree is NP-Hard Finding minimum RST alone is NP-Complete

Buffered Tree Construction

Delay optimization only [Cong, DAC 00] by1. Build Hanan Graph w/buffer insertion nodes

according to locations of buffer stations2. Path search on the grid by option propagation

D-Tree Algorithm Overview

Challenges Growth of option is exponential

An artifact of D-Tree’s NP-hardness Considering power worsens option growth

Solution: sampling + efficient prune tree

Prune Tree in [Lillis, JSSC 96]

Option inserted in sorted capacitance Never need to clear options out from the tree

If new option is checked against the tree Automatically avoid redundant option in tree e.g. Фnew = (c = 20, p = 100, q = 600)

Not applicable to D-Tree problem Order of new options is not known a priori

c=20, q=600

c=10, q=500

c=8, q=400 c=15, q=550

c=12, q=520c=7, q=380

P=100

Our Improvement on Prune Tree

Indexing w/capacitance results in fewer trees # capacitance value < # power value

Efficient “tree cleaning” Enables out-of-order option insertion Guarantee no redundancy in tree

Tree Cleaning

To add an option Фnew in O(|c|·log(|T|)) time1. Check whether Фnew is dominated by any

option in the data-structure2. If not, remove options in the tree dominated

by Фnew in two downward tree traversals• e.g. Фnew = (c = 10, p = 70, q = 410, …)

Experimental Settings for D-Tree

Random testcases All based on a random floorplan of 1cm x 1cm Blockages ~ 30%, buffer stations ~1mm apart

Comparison Delay-optimal tree (RMP) [Cong, DAC 00] Ours with single (S-Tree) and dual

(D-Tree) Vdd Buffer

Experimental Results for D-Tree

Significant power saving over RMP S-Tree: 7%, D-Tree: 18% Larger saving for large testcases (e.g. T4)

Handles up to 6-sink nets (T5 takes 23 mins) Similar capability compared with delay-optimal

approaches [Cong, DAC 00; Chen, ASP-DAC 02]

Testcases Power @ optimal RAT (pJ)Net # nodes # sinks RMP S-Tree D-TreeT3 137 4 3.9 3.5 [-10%] 2.9 [-23%]T4 261 5 4.9 4.4 [-13%] 3.1 [-37%]T5 235 6 4.2 3.8 [-10%] 3.4 [-18%]avg -7% -18%

Conclusion Formulated dual Vdd buffer insertion/tree generation

without level converters

Proposed 2 speedup techniques “Sampling” w/negligible loss of optimality “Improved prune tree” for solution pruning

Applied to single-Vdd buffer insertion, 17x faster than existing work

Large power saving over single Vdd buffering 23% in buffer insertion: dual Vdd vs single Vdd 18% in buffered tree: dual Vdd vs delay optimal

Future Work

Speed up tree construction

Slack allocation for more power reduction Path-based buffer insertion

[Sze, DAC 05] Allocate slack along one interconnect path Consider single Vdd buffers only

Chip level FPGA dual Vdd assignment[Lin, DAC 05]

Fixed buffer location, assign Vdd levels Consider Multiple critical path Solved as a linear programming problem

Power Optimal Dual-V dd Buffered Tree Considering Buffer Stations and Blockages

Documents

Transcript of Power Optimal Dual-V dd Buffered Tree Considering Buffer Stations and Blockages