1 Clock Routing Based on X-Architecture Pattern Matching Chia-Chun Tsai Professor Dept. of Computer...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of 1 Clock Routing Based on X-Architecture Pattern Matching Chia-Chun Tsai Professor Dept. of Computer...
1
Clock Routing Based on X-Architecture Pattern Matching
Chia-Chun TsaiProfessor
Dept. of Computer Science and Information Engineering
Nanhua University
Dept. of Computer Science and EngineeringYuan Ze University
Oct. 03, 2008
2
Outline
IntroductionProblem FormulationProposed AlgorithmExperimental ResultsConclusion
3
Introduction An interesting geometric problem (Clock routing problem).
› How to connect a particular point (clock source) to a number of points (clock sinks) such that each path from a particular point to the points is equal to each other.
Source
Sink
4
Cut 1
Cut 2Cut 3
The MMM (Method of Means and Medians ) algorithm presented with recursively partitioning.
MMM Approach [Jackson 90]
5
H-flip
The GMA (Geometric Matching Algorithm) based on bottom-up matching approach.
GMA Approach [Kahng 91]
6
The WCA (Weighted Center based Algorithm) searched next tapping point with new weighted center
WCA Approach [Bo 91]
7
The DME (Deferred Merge Embedding): The bottom-up phase constructs a tree of merging segments and the top-down embedding phase determines the exact location.
DME Approach [BK92, CHH92, Eda91]
The bottom-up phase in DME The top-down phase in DME
8
The GDME (Grey relation analysis for DME) for an illustration of 29 clock sinks.› Partition S by alternating x- and y-median based on MMM approach until the number of clock sinks in each partition zone, Z, is less or equal to four.
GDME Approach [Wu 07]
9
› Use the Grey relational analysis and associate with the DME approach. Then, recursively split and construct a minimum-cost clock tree.
GDME Approach (Cont’d)
10
Clock Routing for 512 Sinks
11
Clock Tree Construction for Benchmark r5
12
Clock Network in a Chip
Two factors for a clock network, clock delay and clock skew › Max clock delay dominates the operation frequency.› Clock skew (max clock delay – min clock delay) may fail chip functions.› Wanted: minimize the max clock delay and get exact-zero skew
Clock network
A typical architecture of SoC exists a physical clock network.
13
Wire Delay and Sink Loading Two typical delay models for a wire. r is a sheet resistance, ca is
a unit area capacitance, cf is a unit fringing capacitance, and CL is the load capacitance of a clock sink.
Elmore delay model (Elmore 48) The FED (Fitted Elmore delay model) (Abou-Seido 04)
14
Interconnection Delay
Interconnects dominate signal delay
Data from ITRS Roadmap
15
clock source
Source Steiner point Sink
4 4 4 4 2 2 3 3
7 8 7 5
20 25
31 31 32 32 34 34 33 33 Delay = 34
Skew = 34-31 = 3
Delay = max. delay
Skew = max. delay - min. delay
level 3
level 2
level 1
Clock Tree Topology
16
Manhattan routing (horizontal and vertical)› Lead to
- Long wire length on average - Worse performance dominated by interconnect delay
X-architecture routing › Reduce wire length› Proviso: manufacturing technology supports
diagonal routing direction.› TSMC and UMC are ready for 65-nm X-Architecture designs
EE Times, May 25, 2006.http://www.eetimes.com/news/design/showArticle.jhtml?articleID=188500129
Partial routing result: Primary 1 @ 0.13m
Manhattan vs. X-architecture Clock Routings
17
Layer definition in Manhattan and X Architectures
18
Compared Manhattan and X- Architectures
Manhattan vs X-architecture
Same area,higher performance
Same performance,
less area
19
X-architecture (horizontal, vertical and diagonal)
› L= [(x1-x2)2+ (y1-y2)2]1/2
› LM=L(sinα +cosα)› LX=L(0.41sinα+cosα)
› Benefits [Teig IWSLIP2002]:» 20% reduction in wire length» 20% saving in power» 10% improvement in chip performance» 30% reduction in die cost
Partial routing result: Primary 1 @ 0.13m
Arbitrary angle Manhattan arch. X-arch.
45°
(x1, y1)
(x2, y2)
Lα
(x1, y1)
(x2, y2)
LMα
(x2, y2)
LXα
(x1, y1)
Metal 2
Metal 3
Metal 4
Metal 1s1
s2
s1
s2
s1
s2
PB
Manhattan vs. X Architectures
20Routing result: r1 @ 0.13m
Our Contribution
Construct ZST (Zero Skew Tree) based on X-architecture and predefined 16 matching patterns
Simplify DME merging procedures X-flip shortens wire length Wire sizing reduces routing
resources
21
Outline
IntroductionProblem FormulationProposed AlgorithmExperimental ResultsConclusion
22
Problem Formulation
A general CRP (clock routing problem):
Given:
a set of n clock sinks, S = {s1, s2, … sn}
Objective: construct a ZST (Zero Skew clock Tree) based
on X-architecture with better performance.
23
DME-4 [Shen ISCAS06]
Associated with DME (Deferred Merge Embedding) [Chao TCAD92] Construct TOR (Tiled Octangular Region) in bottom-up phase of DME. Resolve the exact coordinates in top-down phase of DME. Use balanced bipartition to reduce wire length. Delay model: FED (Fitted Elmore Delay) [Abou-Seido TVLSI04]
radius1
s1
radius2
s2
TORs1
merging segment
radius1
The construction procedure should be more easy!
24
Metal 2
Metal 3 Metal 4
Metal 1
Node viaEdge via
NVM [Wang VLSI-DAT07]
Also use DME to construct ZST (Zero Skew Tree). Focus on NVM (Node Via Minimization). Reducing #via is crucial. Delay model: Elmore model
They use various layer definitions.
Not practical enough.
25
Definition of Our Clock Problem
Given: a set of clock sinks, S = {s1, s2, … sn} and a X-pattern library.
Objective: construct a ZST based on X-architecture with better performance.
Preliminary› Layer definition› One bend X-pattern› 16 X- patterns as a library
s2
PTN_2
s1
PTN_1
Zone location ofstart point
Zone location of end point
SLT SRT SLB SRB
LT PTN_R PTN_1 PTN_2 PTN_R
RT PTN_1 PTN_R PTN_R PTN_2
LB PTN_2 PTN_R PTN_R PTN_1
RB PTN_R PTN_2 PTN_1 PTN_R
26
Complete routing result:r1 @ 0.13m
X-Pattern
Main idea:› Clock source locates
near the center of routingarea.
Centralize all the routing
wires.
27
X-Pattern (cont’d)
Assumed that s1 and s2 are paired.› Step1. Tile the routing area. s1 locates in LT› Step2. Tile the routing area of s1. s2 locates in SRT› Step3. Define the X-pattern for 4 sub-zones.
s2
s2
s5
s3
s1s6
s8
s7
s4
LT
LB
RT
RB
s2
s1
SRTSLT
SRBSLB
s2
PTN_1
s1
PTN_2
PTN_1
s2
SRTSLT
s2
PTN_1PTN_2PTN_2
SRBSLB
28
X-Pattern (cont’d)
Zone location of start point
Zone location of end point
SLT SRT SLB SRB
LT
RT
LB
RB
s2
s2
s5
s3
s1s6
s8
s7
s4
LT
LB
RT
RB
s2
s1
SRTSLT
SRBSLB
s2
PTN_1
s1
PTN_2
PTN_1
s2
SRTSLT
s2
PTN_1PTN_2PTN_2
SRBSLB
PTN_R PTN_1 PTN_2 PTN_R
PTN_1
PTN_2
PTN_R
PTN_R
PTN_R
PTN_2
PTN_R
PTN_R
PTN_1
PTN_2
PTN_1
PTN_R
29
Outline
IntroductionProblem FormulationProposed AlgorithmExperimental ResultsConclusion
30
Proposed Algorithm
Algorithm PMXF Input: A set of S sinks and 16-kind of X-patterns for a pair of pointsOutput: A ZST based on X-architecture with zero skew and minimal
delaybegin1. While(|S| > 1)2. { (s1, s2) = DPPG(S); //Determine a pair of points using GMA
3. Pattern = CPXP(s1, s2)∩CPXP(s2, s1); //Choose proper X-pattern
4. Pt = DCTP(s1, s2, x); //Find tapping point Pt of s1 and s2
5. If (x<0) WireSizing(s1, s2); //Adjust w2
6. If (x>1) WireSizing(s2, s1); //Adjust w1
7. DME-X(s1, s2, Pt, Pattern); //Construct the clock tree
8. X-Flip(s1, s2); //Reduce wire length
9. Insert(S, Pt); //Insert Pt to S
10. }End
PMXF (Pattern-Matching based on X-clock routing with X-Flip) algorithm
31
DPPG Procedure
Determine Pair of Points in GMA› GMA is a bottom-up algorithm
[Kahng DAC91]› Focus on path-length balancing
X4
X6
X2
X1
X7
X8
X3
X5
X9
X10
X15X12
X11
X13
X14
DPPG
DPPG
DPP
G
DPPG
DPPG
DPPG
DPPG
Time complexity O(logn)
32
CPXP Procedure Choose Proper X-Pattern
› Ex. CPXP(X1, X2)› Step1. Tile the routing area x1 locates in LT› Step2. Tile the routing area
of start point, x1 x2 locates in SRT
› Step3. Map the given X-pattern table
CPXP(X1, X2)=PTN_1 CPXP(X2, X1)=PTN_R
CPXP(X1, X2)∩CPXP(X1,X2)=PTN_1
X4
X6
X2
X1
X7
X8
X3
X5
X9
X10
X15X12
X11
X13
X14
CPXP
LT
LB
RT
RB
SLT
SLB
SRT
SRB
CPXP
CPXP
CPXP
CPXP
CPXP
CPXP
Time complexity O(logn)
33
DCTP Procedure
Determine Coordinate of Tapping Point› Tapping point, Pt is determined to achieve zero skew. [Tsay ICCAD91]› Zero skew condition ratio, x.› If 0≤x≤1, tapping point locates on wire.› If x< 0 or x>1, need snaking wire.› Use binary search to determine the coordinate. [Wu IEICE07]
Time complexity
O(n)
34
Wire Sizing
Snaking wire is one of public methods for constructing ZST. Benefits of adopting wire sizing [El-Moursy GLSVLSI03]
› Release routing resources› But need extra power due to wider wires
Snaking wire Sized wire
35
Wire Sizing (cont’d)
Consider the zero skew condition, x < 0.
]2
)([),,(FED),( 1
11
1
11111 L
faLt FC
lEcwDC
w
lrwlCPsdx
),( 2 tPsdx
)(2
)2
(
)(
2111
1
1
2222
2
llDC
FClEC
w
l
FClw
aL
f
L
lEC f
Time complexity O(n)
36
DME-X Procedure
Traditional DME based on X-arch. Bottom-up
phase› Create
TOR.
› Merge.
X4
X6
X2
X1
X7
X3
X5
X8
37
DME-X (cont’d)
Traditional DME based on X-arch. Bottom-up
phase› Create
TOR.
› Merge.
Top-down phase› Determine
points’ locations.› Connect all the nodes.
X4
X6
X2
X1
X7
X3
X5
X15X12
X11
X13
X14
X8
X9
X10
38
DME-X (cont’d)
Our DME-X method› Integrate bottom-up and top-down phases› Construct the
parallelogram
› DCTP(X4, X6)
› CPXP(X4, X6) ∩CPXP(X6, X4)
› Tip! Run CPXP firstthen DCTP for savingrunning time.
X4
X6
X2
X1
X7
X3
X5
X15X12
X11
X14
X13
X8
X9
X10
X9’DPPG
Time complexity O(n)
39
X-Flip Procedure
Exchange X-pattern based on
predefined patterns
s1
PTN_2
PTN_1
s2 s2
PTN_2
s1
PTN_1
Delay = 4454.614 ps
Cost = 38219.374 m
Power = 0.000531 w
Complete routing result:08-5 @ 0.13m
40
X-Flip (cont’d)
Check the length of the i-1th level when constructing the ith level.
Delay = 4139.209 ps, saving 7%
Cost = 36334.753 m, saving 4.9%
Power = 0.000515 w, saving 3%
Complete routing result:08-5 @ 0.13m with X-Flip
Time complexity O(n)
41
Time Complexity Analysis
Algorithm PMXFInput: A set of S sinks and 16-kind of X-patterns for a pair of pointsOutput: A ZST based on X-architecture with zero skew and minimal
delaybegin1. While(|S| > 1)2. { (s1, s2) = DPPG(S); //Determine a pair of points using GMA
3. Pattern = CPXP(s1, s2)∩CPXP(s2, s1); //Choose proper X-pattern
4. Pt = DCTP(s1, s2, x); //Find tapping point Pt of s1 and s2
5. If (x<0) WireSizing(s1, s2); //Adjust w2
6. If (x>1) WireSizing(s2, s1); //Adjust w1
7. DME-X(s1, s2, Pt, Pattern); //Construct the clock tree
8. X-Flip(s1, s2); //Reduce wire length
9. Insert(S, Pt); //Insert Pt to S
10. }End
Time complexity O(logn)
Time complexity O(n logn)
Time complexity O(n)
42
Outline
IntroductionProblem FormulationProposed AlgorithmExperimental ResultsConclusion
43
Experimental Results
Platform: WinXP-SP2 on P4-M 1.7G with 1G Memory Compiler: Borland C++ Builder 6.0 IBM benchmarks, r1-r5, for testing our algorithm PMXF Our PMXF is compared with
› DME-4 [Shen ISCAS06] based on fitted Elmore delay model› NVM [Wang VLSI-DAT07] based on Elmore delay model
0.13m fabrication parameters are used.DME-4 (fitted Elmore) NVM (Elmore)
r 0.623Ω/m D 1.12673ln2 Fclk 100MHz r 0.623Ω/m
ca 0.00598fF/m E 1.10463ln2 Vdd 1.2V c 0.118fF/μm
cf 0.043fF/m F 1.04836ln2
44
Our GUI
45
Our GUI (cont’d)
46
Our Results based on FED Model
Benchmark
#Sinkdelay (s) wirelength (m) power (W) total via runtime (s)
PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX
r1 26 0.47415 0.310858 0.656 1406401 1383347 0.983 0.074957 0.076774 1.024 1248 1215 0.973 6.72 6.859 1.02
r2 598 1.130498 0.841717 0.744 3000575 2863408 0.954 0.194351 0.19197 0.987 2744 2816 1.026 28.12 31.535 1.121
r3 862 1.632144 1.790971 1.097 3750372 3651790 0.973 0.263825 0.25889 0.981 3993 4019 1.006 65.163 70.261 1.078
r4 1903 4.639215 3.989911 0.860 7593864 7221328 0.950 0.617959 0.599185 0.969 9170 9160 0.998 754.024 897.16 1.189
r5 3101 8.987384 7.881827 0.877 11322668 10855445 0.958 1.023591 0.998684 0.975 14624 14528 0.993 1840.375 2309.672 1.255
Average - - 0.847 - - 0.964 - - 0.987 - - 0.999 - - 1.126Improve 15.3% in delayImprove 3.6% in wire length and 1.3% in powerImprove 0.1% in total via, but need more 12.6% in runtime
Compare our PMXF algorithm without/ with X-Flip in terms of delay, wire length, power consumption,total via, and runtime for FED model
47
Our Results Based on ED Model
Benchmark
#Sinkdelay (s) wirelength (m) power (W) total via runtime (s)
PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX
r1 26 0.165138 0.137993 0.836 1517071 1364700 0.900 0.165665 0.163641 0.988 1230 1229 0.999 6.930 7.491 1.081
r2 598 0.455854 0.320785 0.704 2917878 2788433 0.956 0.377402 0.374030 0.991 2846 2867 1.007 25.577 30.144 1.179
r3 862 0.526273 0.498202 0.947 3757000 3696636 0.984 0.525380 0.514669 0.980 4016 3993 0.994 60.213 66.957 1.112
r4 1903 1.822653 1.614070 0.886 7513128 7363705 0.980 1.208581 1.185089 0.981 9220 8912 0.967 732.046 790.046 1.079
r5 3101 2.577663 2.095517 0.813 11246479 10854213 0.965 1.978120 1.952519 0.987 14718 14546 0.988 1597.062 1689.281 1.058
Average - - 0.837 - - 0.957 - - 0.985 - - 0.991 - - 1.102Improve 16.3% in delayImprove 4.3% in wire length and 1.5% in powerImprove 0.9% in total via, but need more 10.2% in runtime
Compare our PMXF algorithm without/ with X-Flip in terms of delay, wire length, power consumption,total via, and runtime for ED model
48
Clock Tree Construction of r5 Based on PMXF
#sinks: 3101
Delay: 7.881827 s
Skew: 0
#vias: 14528
Power: 0.998684 W
Runtime: 2309.672s
49
Our Results Compared with DME-4
[8] W. Shen, Y. Cai, J. Hu, X. Hong, and B. Lu, “High Performance Clock Routing in X-architecture,” IEEE International Symposium On Circuits and Systems, 2006, pp. 2081-2084.
Compare our PMXF algorithm with DME-4[8] in terms of delay, wire length, and power consumption for FED model.
50
Our Results Compared with DME-4
Benchmarks #sinksDelay (s)
DME-4[8] PMXF PMXF/DME-4
r1 267 0.471340 0.310858 0.659
r2 598 1.145970 0.841717 0.734
r3 862 1.664930 1.790971 1.075
r4 1903 4.631840 3.989911 0.861
r5 3101 9.053950 7.881827 0.871
Average - - 0.840
Improve 16% in delay
The comparison of our algorithm and DME-4[8] in delay
51
Our Results Compared with DME-4
Benchmarks #sinksWire length (m)
DME-4[8] PMXF PMXF/DME-4
r1 267 1414960 1383347 0.977
r2 598 2863420 2863408 0.999
r3 862 3656580 3651790 0.998
r4 1903 7245500 7221328 0.996
r5 3101 10971100 10855445 0.989
Average - - 0.992
Improve 0.8% in wire length
The comparison of our algorithm and DME-4[8] in wire length
52
Our Results Compared with DME-4
Benchmarks #sinksPower (w)
DME-4[8] PMXF PMXF/DME-4
r1 267 0.074594 0.076785 1.029
r2 598 0.180590 0.174153 0.964
r3 862 0.254845 0.258602 1.015
r4 1903 0.589042 0.583533 0.991
r5 3101 0.981078 0.909697 0.927
Average - - 0.985
Improve 1.5% in power
The comparison of our algorithm and DME-4[8] in power
53
Our Results Compared with NVM
[9] C. H. Wang and W. K. Mak, “λ-Geometry clock tree construction with wire length and via minimization,” IEEE International Symposium on VLSI-DAT, 2007, pp. 124-127.
Compare our PMXF algorithm with NVM[9] in terms of via and wire length for ED model.
54
Our Results Compared with NVM
Benchmarks
#sinksNode via Total via
NVM[9] PMXFNVM[9
]PMXF PMXF/NVM
r1 267 832 720 1486 1229 0.827
r2 598 1859 1658 3401 2867 0.843
r3 862 2689 2335 4921 3993 0.811
r4 1903 6046 5200 10769 8912 0.828
r5 3101 9818 8504 17681 14546 0.823
Average - - - - 0.826Improve 17.4% in total via
The comparison of our algorithm and NVM[9] in node/ total via
55
Benchmarks #sinksWire length (m)
NVM[8] PMXF PMXF/NVM
r1 267 1200300 1364700 1.137
r2 598 2354000 2788433 1.185
r3 862 3074900 3696636 1.202
r4 1903 6145000 7363705 1.198
r5 3101 9152300 10854213 1.186
Average - - 1.181
Worsen 18.1% in wire length
Our Results Compared with NVM
The comparison of our algorithm and NVM[9] in wire length
56
Explanation for Wire Length Why we performed worse than NVM in wire length?
› The wire length is determined in different topologies.
Merging rules used in NVM› N=|S|/c. where c is a constant.› Step1. sort the corresponding edges, eim, of sink i, si.
Where si∈S, eim E∈› Step2. Get the first N number in E.› Step3. Merge N number elements in S.› Step4. Remove the merged elements from S and add new merging
one.
57
Conclusion
X-architecture has been proven more effective than Manhattan architecture in constructing ZST.
We defined 16 X-routing patterns to simply the merging procedures in constructing X-based ZST.
X-flip can shorten wire length and minimize clock delay. Wire sizing removes snaking wires and save routing
resources. Our algorithm performs well in clock delay, wire length,
power and via cost.
58
Future Works
Insert buffers to get higher performance Consider the inductance effects in delay model Consider DFM problems
› Antenna effect› Optical correction› Redundant via insertion› CMP variation
Planar X-routing with less metal layers
59
Thank you for attendance!