Parallel Hierarchical Cross Entropy Optimization for On...

Design Automation Group

Parallel Hierarchical Cross Entropy Optimization for On-Chip Decap Budgetingp p g g

Xueqian Zhao Yonghe GuoYonghe GuoZhuo FengShiyan Hu

Department of Electrical & Computer EngineeringMichigan Technological University

X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 20101 2010 ACM/EDAC/IEEE Design Automation Conference

Outline

Introduction

Problem Formulation

•Importance Sampling Based•Hierarchical Optimization•Sensitivity Guided•Parallelized in GPU environment

The Cross Entropy Based Algorithm

Experimental Results

Conclusion

X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 20102

Power Supply NetworkPower supply grid is one of the most important sources of noise.

VddVdd

Interconnect wire

CurrentNode

Functional gate


Voltage DropPower Supply NoiseS l lt i ti lt i l i hi h l d t

g p

VddV(t)

Supply voltage variation can result in supply noise which can lead to problems related to logic error, spurious transitions and delay variations.

Vdd

Vth

Noise gj

( ) max( ( ) 0)T

g c c V v t dt= ∫0 T tt1 t2

10

( ,..., ) max( ( ),0)j m th jg c c V v t dt= −∫H. Su, S. Sapatnekar, and S. Nassif. Optimal decoupling capacitor sizing

d l t f t d d ll l t d i (IEEE T CAD ’03)


and placement for standard-cell layout designs. (IEEE Trans. on CAD, ’03)

Equivalent Power Grid ModelPower Grid Transient AnalysisqUsing power grid transient analysis to identify the power supply noise.In simulation, gates are replaced by pulse current sources.

Vdd( ) ( ) ( )dv tC Gv t b t

dt+ =

Vdd

Ids

t


DecapDecoupling Capacitor (Decap)D i ti d ff t th l lt i tiDecapDecap insertion and effect on the supply voltage variation

Decap

Current is partially supplied by decap.

Decap

Before Applying Decap After Applying Decap


Before Applying Decap After Applying Decap

6

Budget Constrained Decap Optimization

n

Our objective is to minimize the total noise subject to the global and local constraints.

11

min: ( ,..., )j mj

g c c=∑ m candidate decap locations/nodes

. . ii u

m

st c C

C

≤

≤∑

Local size constraint

1

i toti

c C=

≤∑ Global budget constraint

Constraints: limited empty space in the chip; leakage power; impact inConstraints: limited empty space in the chip; leakage power; impact in routing of interconnected wires, etc. H. Su, S. Sapatnekar, and S. Nassif. Optimal decoupling capacitor sizing

d l t f t d d ll l t d i IEEE T CAD ’03


and placement for standard-cell layout designs, IEEE Trans. on CAD, ’03

Motivation

Sensitivity-guided Cross Entropy Based Optimization(SCE)– Relative sensitivityy

– Importance Sampling

– Easy to be Parallelized

Hierarchical Optimization– Different Strategies for Block-level and Node-level Decap BudgetingDifferent Strategies for Block level and Node level Decap Budgeting

Parallel AccelerationParallel Acceleration– GPU Acceleration for Power Grid Simulation

– Parallel Samples Evaluation on Multi-core Many-core platform


p y p

8

Decap Sensitivity

• More efficient rule for decap budgeting.

• Decap Sensitivity:Decap Sensitivity:

1( ,..., )n

j mg c c∂∑ 11

,

( , , )j mj

i all

gs

c==

∂

∑ic∂

The above formula can not be directly used for sensitivity computation:1 did t d d t i t l i (ti i )1. m candidate nodes need m transient analysis (time consuming)2. difficult to determine

ic∂


Efficient Sensitivity Computation

Adjoint sensitivity computation: needs only one original network transient analysis and one adjoint network transient analysis.

Two networks have the same topology but different sources setup

Original Network Adjoint Network

Vdd Gnd

Violating Node Violating Node


Efficient Sensitivity Computation(cont.)

Adjoint sensitivity computation: convolution of the two voltage waveforms obtained from each network.

*

( )iOriginal v t⇒

,

* ( )i all

T

Adjoint V t⇒

,

*,

0

( ) ( )i all

T

i all is V T t v t dt= −∫L. Pillage, R. Rohrer and C. Visweswariah, Electronic circuit & system simulation methods, McGraw-Hill, 1995.

0


, ,

11

Partitioning• Reduce solution space from a great number of candidate nodes to fewer number of candidate blocks.• Foundation of hierarchical optimization in block-level and node-level

Candidate block

Candidate node


Main Idea

Hierarchical Optimization – Different strategies for block-level and node-level

ti i tioptimization

DecapDecapDecap assigned at

block n

Relative sensitivity basedNode-level Decap Budgeting

Cross Entroy based Block-Level Decap Budgeting


Node level Decap BudgetingLevel Decap Budgeting

Node-level Relative-sensitivity based Optimization• Relative sensitivity based optimization

• Relative sensitivity is approximately constant within a small block• No need to re-evaluate the sensitivity after each iterationNo need to re evaluate the sensitivity after each iteration

constanta

b

ss

≈b

The relative impact to noise d i b breduction between nearby

decaps always keep the same before and after decap budgetingbudgeting


Empirical ValidationThe figure shows the relative sensitivities before and after decap insertion within a block with size of 30 x 30.


Block-level Cross Entropy based OptimizationC E t M th d(CE)Cross Entropy Method(CE)

– A general Monte Carlo approach using importance sampling technique

– Rare event probability estimation

– In any optimization problem, optimum solution can be considered as a rare eventas a rare event

( )( ) [ ( ) ] [ ]

( ) representing the objective functionf X aa P f X a E I

f x

δ ≤= ≤ =

( ) representing the objective function( ) denoting the PDF for general Monte Carlo method being a set of samples generated from ( )

f xg xX g x

denoting the thresholda

minimize s.t. ( ) 0a aδ →


Importance Sampling

General Monte Carlo: g(x) needs a lot of randomly generated samples, but would not obtain accurate result (There would be none sample falling into rare event region).

Use a different PDF k(x) not g(x) to estimate δ(a) as β(a). Most of

g g )

Importance sampling is used to reduce the number of samples

( )1 n X

( ) g( ) ( ) β( )samples generated by k(x) will fall into the rare event region. Thus, only a few samples are needed.

( )1

( )1( )( )i

ni

f X ai i

g Xa In k X

β ≤=

= ∑

( )* ( )( ) ( ) ( )

( )if X aI g x

a a k xa

δ βδ

≤= ⇔ =


( )

17

CE Based Decap Insertion

CE consists of two phases in a nutshell– Generate a series of random data samples according to a initial

specified PDFspecified PDF

– Update the E(x), δ2(x) and etc. of the PDF based on the previous "good" samples to produce "better" samples in the next iteration.

k(x) x1k(x): PDF in solution space

x1

k’(x)

k(x)

x* x*

x20 x20

x*: Optimal solution


p

18

CE Algorithmic Flow(For 2-block variables)It ti 1

Decap budget at block 2

Iteration 1k(x)

Iteration 1k(x)

Pick top solutions with smallest noisewith smallest noiseto update PDF

Decap budget at block 10

k( )

0

Iteration 2Repeat until convergencek*(x)

k(x) Iteration 2k(x)Generate another group of samples

0

Optimumx*


0 0

Parallel Decap Budgeting

Decap budget at block 2g(x)

Core 1

Core 2

g(x)

D b d t t bl k 10

Core 2

Bottleneck of computation, but can be easilyDecap budget at block 10 can be easily parallelized.

Evaluate noise of each solution


Muti/Many-Core Based ParallelizationThe graph shows the flow of multi-thread SCE samples processing with multi-GPU.

Generate n samples

Th d 1 Th d kThread 1 Thread k

n/k Samples n/k Samples n/k Samples ProcessingOn GPU 1

n/k Samples ProcessingOn GPU k

Pick Top Best Ones

Z. Feng and P. Li. Multigrid on GPU: tackling power grid


analysis on parallel SIMT platforms, ICCAD’08.

Complete Sensitivity-guided CE(SCE) Algorithm Flow

Power Grid Partition & Sensitivity Calculation

Build up a PDF for Solutions

G t Bl k L l DGenerate Block-Level DecapBudgeting samples using PDF

Determine Decap Size for EachDetermine Decap Size for Each Node Based on Relative Sensitivity

Evaluate Solutions on If NotConvergeMulti-Core Multi-GPU Converge

Result Comparison


Experimental Setup

Hardware Platform SetupIBM Po er Grid Benchmarks (S Nassif ASPDAC ‘08)– IBM Power Grid Benchmarks (S. Nassif ASPDAC ‘08)

– C++/ GPU CUDA

– Intel Quad-Core CPU, 2.66 GHzIntel Quad Core CPU, 2.66 GHz

– Two NVIDIA GeForce GTX285 Graphics Cards

– Ubuntu 8.04, 64-bit

Compare to a recent conjugate gradient based decap optimization approach(iCG)optimization approach(iCG)

– H. Li, J. Fan, Z. Qi, S. Tan, L. Wu, Y. Cai and X. Hong, Partitioning-Based Approach to Fast On-Chip Decoupling Capacitor Budgeting and Minimization. (TCAD ’06).


( )

23

Comparison - IThe figure shows total noise after decap insertion under different budgets and methods.

The figureThe figure shows total noise after decap insertion under different budgets and methods.


Noise-Decap Budget TradeoffUsing our SCE method, 70% decap budget can eliminate most of the power supply noises.


Comparison – II-1

Partition-Based SCE

The figure shows comparison of runtime, total noise and number of iteration among different methods.

Budget 50% iCG CEPartition Based SCE

Block dim 10x10

Block dim 25x25

CKT #vio.N N(%) Iter. T(s) N(%) Iter. T(s) N(%) Iter. T(s) N(%) Iter. T(s)

ibm2 481 19.7 15 62 35.1 20 316 14.8 3 38 15.8 2 25ibm2 481 19.7 15 62 35.1 20 316 14.8 3 38 15.8 2 25

ibm4 1,829 24.2 15 638 -- -- -- 19.2 4 401 20.3 3 300

ibm5 1,809 47.2 15 1265 -- -- -- 38.1 4 1026 42.1 3 729

ibm6 1,926 30.1 15 1409 -- -- -- 27.7 5 1258 28.1 3 771


Comparison – II-2

Partition-Based SCE


Budget 70% iCG CEBlock dim

10x10Block dim

25x25


ibm2 481 1 83 13 55 12 4 20 312 1 1 77 3 38 1 78 2 25ibm2 481 1.83 13 55 12.4 20 312.1 1.77 3 38 1.78 2 25

ibm4 1,829 7.1 14 592 -- -- -- 1.7 3 307 3.7 2 203

ibm5 1,809 38.3 15 1286 -- -- -- 24.1 4 1028 23.8 3 735

ibm6 1 926 6 4 15 1430 -- -- -- 5 1 5 1219 6 0 3 769


ibm6 1,926 6.4 15 1430 5.1 5 1219 6.0 3 769

Comparison – II-3

Partition-Based SCE


Budget 90% iCG CEPartition Based SCE

Block dim 10x10

Block dim 25x25


ibm2 481 0.02 16 65 0.02 19 294 0.004 5 63 0.01 3 37ibm2 481 0.02 16 65 0.02 19 294 0.004 5 63 0.01 3 37

ibm4 1,829 0.00 16 617 -- -- -- 0.00 3 299 0.00 4 398

ibm5 1,809 31.2 17 1459 -- -- -- 7.1 5 1251 8.4 3 1119

ibm6 1,926 0.00 1 151 -- -- -- 0.00 1 354 0.00 1 356


Speedup Between Different SetupThe figure below shows the comparison of time cost between decap simulation under single GPU and double GPUs.


Conclusion

■ A novel cross entropy based optimization technique is proposed for decoupling capacitor budgeting problem.p p p g p g g p■ Sensitivity Guided■ Hierarchical Optimization■ Parallelization-friendly for multi-/many-core platforms

E i t l lt d t t th t l ith■ Experimental results demonstrate that our algorithm runs 2x faster than prior approach and obtain 25% better results in the final decap budgeting solutions.p g g


Th k !Thanks!


Parallel Hierarchical Cross Entropy Optimization for On...

Documents

Transcript of Parallel Hierarchical Cross Entropy Optimization for On...