Parallel Hierarchical Cross Entropy Optimization for On...
Transcript of Parallel Hierarchical Cross Entropy Optimization for On...
Design Automation Group
Parallel Hierarchical Cross Entropy Optimization for On-Chip Decap Budgetingp p g g
Xueqian Zhao Yonghe GuoYonghe GuoZhuo FengShiyan Hu
Department of Electrical & Computer EngineeringMichigan Technological University
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 20101 2010 ACM/EDAC/IEEE Design Automation Conference
Outline
Introduction
Problem Formulation
•Importance Sampling Based•Hierarchical Optimization•Sensitivity Guided•Parallelized in GPU environment
The Cross Entropy Based Algorithm
Experimental Results
Conclusion
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 20102
Power Supply NetworkPower supply grid is one of the most important sources of noise.
VddVdd
Interconnect wire
CurrentNode
Functional gate
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 20103
Voltage DropPower Supply NoiseS l lt i ti lt i l i hi h l d t
g p
VddV(t)
Supply voltage variation can result in supply noise which can lead to problems related to logic error, spurious transitions and delay variations.
Vdd
Vth
Noise gj
( ) max( ( ) 0)T
g c c V v t dt= ∫0 T tt1 t2
10
( ,..., ) max( ( ),0)j m th jg c c V v t dt= −∫H. Su, S. Sapatnekar, and S. Nassif. Optimal decoupling capacitor sizing
d l t f t d d ll l t d i (IEEE T CAD ’03)
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 20104
and placement for standard-cell layout designs. (IEEE Trans. on CAD, ’03)
Equivalent Power Grid ModelPower Grid Transient AnalysisqUsing power grid transient analysis to identify the power supply noise.In simulation, gates are replaced by pulse current sources.
Vdd( ) ( ) ( )dv tC Gv t b t
dt+ =
Vdd
Ids
t
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 20105
DecapDecoupling Capacitor (Decap)D i ti d ff t th l lt i tiDecapDecap insertion and effect on the supply voltage variation
Decap
Current is partially supplied by decap.
Decap
Before Applying Decap After Applying Decap
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 2010
Before Applying Decap After Applying Decap
6
Budget Constrained Decap Optimization
n
Our objective is to minimize the total noise subject to the global and local constraints.
11
min: ( ,..., )j mj
g c c=∑ m candidate decap locations/nodes
. . ii u
m
st c C
C
≤
≤∑
Local size constraint
1
i toti
c C=
≤∑ Global budget constraint
Constraints: limited empty space in the chip; leakage power; impact inConstraints: limited empty space in the chip; leakage power; impact in routing of interconnected wires, etc. H. Su, S. Sapatnekar, and S. Nassif. Optimal decoupling capacitor sizing
d l t f t d d ll l t d i IEEE T CAD ’03
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 20107
and placement for standard-cell layout designs, IEEE Trans. on CAD, ’03
Motivation
Sensitivity-guided Cross Entropy Based Optimization(SCE)– Relative sensitivityy
– Importance Sampling
– Easy to be Parallelized
Hierarchical Optimization– Different Strategies for Block-level and Node-level Decap BudgetingDifferent Strategies for Block level and Node level Decap Budgeting
Parallel AccelerationParallel Acceleration– GPU Acceleration for Power Grid Simulation
– Parallel Samples Evaluation on Multi-core Many-core platform
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 2010
p y p
8
Decap Sensitivity
• More efficient rule for decap budgeting.
• Decap Sensitivity:Decap Sensitivity:
1( ,..., )n
j mg c c∂∑ 11
,
( , , )j mj
i all
gs
c==
∂
∑ic∂
The above formula can not be directly used for sensitivity computation:1 did t d d t i t l i (ti i )1. m candidate nodes need m transient analysis (time consuming)2. difficult to determine
ic∂
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 20109
Efficient Sensitivity Computation
Adjoint sensitivity computation: needs only one original network transient analysis and one adjoint network transient analysis.
Two networks have the same topology but different sources setup
Original Network Adjoint Network
Vdd Gnd
Violating Node Violating Node
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201010
Efficient Sensitivity Computation(cont.)
Adjoint sensitivity computation: convolution of the two voltage waveforms obtained from each network.
*
( )iOriginal v t⇒
,
* ( )i all
T
Adjoint V t⇒
,
*,
0
( ) ( )i all
T
i all is V T t v t dt= −∫L. Pillage, R. Rohrer and C. Visweswariah, Electronic circuit & system simulation methods, McGraw-Hill, 1995.
0
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 2010
, ,
11
Partitioning• Reduce solution space from a great number of candidate nodes to fewer number of candidate blocks.• Foundation of hierarchical optimization in block-level and node-level
Candidate block
Candidate node
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201012
Main Idea
Hierarchical Optimization – Different strategies for block-level and node-level
ti i tioptimization
DecapDecapDecap assigned at
block n
Relative sensitivity basedNode-level Decap Budgeting
Cross Entroy based Block-Level Decap Budgeting
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201013
Node level Decap BudgetingLevel Decap Budgeting
Node-level Relative-sensitivity based Optimization• Relative sensitivity based optimization
• Relative sensitivity is approximately constant within a small block• No need to re-evaluate the sensitivity after each iterationNo need to re evaluate the sensitivity after each iteration
constanta
b
ss
≈b
The relative impact to noise d i b breduction between nearby
decaps always keep the same before and after decap budgetingbudgeting
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201014
Empirical ValidationThe figure shows the relative sensitivities before and after decap insertion within a block with size of 30 x 30.
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201015
Block-level Cross Entropy based OptimizationC E t M th d(CE)Cross Entropy Method(CE)
– A general Monte Carlo approach using importance sampling technique
– Rare event probability estimation
– In any optimization problem, optimum solution can be considered as a rare eventas a rare event
( )( ) [ ( ) ] [ ]
( ) representing the objective functionf X aa P f X a E I
f x
δ ≤= ≤ =
( ) representing the objective function( ) denoting the PDF for general Monte Carlo method being a set of samples generated from ( )
f xg xX g x
denoting the thresholda
minimize s.t. ( ) 0a aδ →
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201016
Importance Sampling
General Monte Carlo: g(x) needs a lot of randomly generated samples, but would not obtain accurate result (There would be none sample falling into rare event region).
Use a different PDF k(x) not g(x) to estimate δ(a) as β(a). Most of
g g )
Importance sampling is used to reduce the number of samples
( )1 n X
( ) g( ) ( ) β( )samples generated by k(x) will fall into the rare event region. Thus, only a few samples are needed.
( )1
( )1( )( )i
ni
f X ai i
g Xa In k X
β ≤=
= ∑
( )* ( )( ) ( ) ( )
( )if X aI g x
a a k xa
δ βδ
≤= ⇔ =
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 2010
( )
17
CE Based Decap Insertion
CE consists of two phases in a nutshell– Generate a series of random data samples according to a initial
specified PDFspecified PDF
– Update the E(x), δ2(x) and etc. of the PDF based on the previous "good" samples to produce "better" samples in the next iteration.
k(x) x1k(x): PDF in solution space
x1
k’(x)
k(x)
x* x*
x20 x20
x*: Optimal solution
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 2010
p
18
CE Algorithmic Flow(For 2-block variables)It ti 1
Decap budget at block 2
Iteration 1k(x)
Iteration 1k(x)
Pick top solutions with smallest noisewith smallest noiseto update PDF
Decap budget at block 10
k( )
0
Iteration 2Repeat until convergencek*(x)
k(x) Iteration 2k(x)Generate another group of samples
0
Optimumx*
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201019
0 0
Parallel Decap Budgeting
Decap budget at block 2g(x)
Core 1
Core 2
g(x)
D b d t t bl k 10
Core 2
Bottleneck of computation, but can be easilyDecap budget at block 10 can be easily parallelized.
Evaluate noise of each solution
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201020
Muti/Many-Core Based ParallelizationThe graph shows the flow of multi-thread SCE samples processing with multi-GPU.
Generate n samples
Th d 1 Th d kThread 1 Thread k
n/k Samples n/k Samples n/k Samples ProcessingOn GPU 1
n/k Samples ProcessingOn GPU k
Pick Top Best Ones
Z. Feng and P. Li. Multigrid on GPU: tackling power grid
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201021
analysis on parallel SIMT platforms, ICCAD’08.
Complete Sensitivity-guided CE(SCE) Algorithm Flow
Power Grid Partition & Sensitivity Calculation
Build up a PDF for Solutions
G t Bl k L l DGenerate Block-Level DecapBudgeting samples using PDF
Determine Decap Size for EachDetermine Decap Size for Each Node Based on Relative Sensitivity
Evaluate Solutions on If NotConvergeMulti-Core Multi-GPU Converge
Result Comparison
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201022
Experimental Setup
Hardware Platform SetupIBM Po er Grid Benchmarks (S Nassif ASPDAC ‘08)– IBM Power Grid Benchmarks (S. Nassif ASPDAC ‘08)
– C++/ GPU CUDA
– Intel Quad-Core CPU, 2.66 GHzIntel Quad Core CPU, 2.66 GHz
– Two NVIDIA GeForce GTX285 Graphics Cards
– Ubuntu 8.04, 64-bit
Compare to a recent conjugate gradient based decap optimization approach(iCG)optimization approach(iCG)
– H. Li, J. Fan, Z. Qi, S. Tan, L. Wu, Y. Cai and X. Hong, Partitioning-Based Approach to Fast On-Chip Decoupling Capacitor Budgeting and Minimization. (TCAD ’06).
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 2010
( )
23
Comparison - IThe figure shows total noise after decap insertion under different budgets and methods.
The figureThe figure shows total noise after decap insertion under different budgets and methods.
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201024
Noise-Decap Budget TradeoffUsing our SCE method, 70% decap budget can eliminate most of the power supply noises.
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201025
Comparison – II-1
Partition-Based SCE
The figure shows comparison of runtime, total noise and number of iteration among different methods.
Budget 50% iCG CEPartition Based SCE
Block dim 10x10
Block dim 25x25
CKT #vio.N N(%) Iter. T(s) N(%) Iter. T(s) N(%) Iter. T(s) N(%) Iter. T(s)
ibm2 481 19.7 15 62 35.1 20 316 14.8 3 38 15.8 2 25ibm2 481 19.7 15 62 35.1 20 316 14.8 3 38 15.8 2 25
ibm4 1,829 24.2 15 638 -- -- -- 19.2 4 401 20.3 3 300
ibm5 1,809 47.2 15 1265 -- -- -- 38.1 4 1026 42.1 3 729
ibm6 1,926 30.1 15 1409 -- -- -- 27.7 5 1258 28.1 3 771
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201026
Comparison – II-2
Partition-Based SCE
The figure shows comparison of runtime, total noise and number of iteration among different methods.
Budget 70% iCG CEBlock dim
10x10Block dim
25x25
CKT #vio.N N(%) Iter. T(s) N(%) Iter. T(s) N(%) Iter. T(s) N(%) Iter. T(s)
ibm2 481 1 83 13 55 12 4 20 312 1 1 77 3 38 1 78 2 25ibm2 481 1.83 13 55 12.4 20 312.1 1.77 3 38 1.78 2 25
ibm4 1,829 7.1 14 592 -- -- -- 1.7 3 307 3.7 2 203
ibm5 1,809 38.3 15 1286 -- -- -- 24.1 4 1028 23.8 3 735
ibm6 1 926 6 4 15 1430 -- -- -- 5 1 5 1219 6 0 3 769
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201027
ibm6 1,926 6.4 15 1430 5.1 5 1219 6.0 3 769
Comparison – II-3
Partition-Based SCE
The figure shows comparison of runtime, total noise and number of iteration among different methods.
Budget 90% iCG CEPartition Based SCE
Block dim 10x10
Block dim 25x25
CKT #vio.N N(%) Iter. T(s) N(%) Iter. T(s) N(%) Iter. T(s) N(%) Iter. T(s)
ibm2 481 0.02 16 65 0.02 19 294 0.004 5 63 0.01 3 37ibm2 481 0.02 16 65 0.02 19 294 0.004 5 63 0.01 3 37
ibm4 1,829 0.00 16 617 -- -- -- 0.00 3 299 0.00 4 398
ibm5 1,809 31.2 17 1459 -- -- -- 7.1 5 1251 8.4 3 1119
ibm6 1,926 0.00 1 151 -- -- -- 0.00 1 354 0.00 1 356
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201028
Speedup Between Different SetupThe figure below shows the comparison of time cost between decap simulation under single GPU and double GPUs.
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201029
Conclusion
■ A novel cross entropy based optimization technique is proposed for decoupling capacitor budgeting problem.p p p g p g g p■ Sensitivity Guided■ Hierarchical Optimization■ Parallelization-friendly for multi-/many-core platforms
E i t l lt d t t th t l ith■ Experimental results demonstrate that our algorithm runs 2x faster than prior approach and obtain 25% better results in the final decap budgeting solutions.p g g
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201030
Th k !Thanks!
X. Zhao X. Zhao et al. et al. 4747thth DAC, June 17DAC, June 17thth, 2010, 201031