1 Clock Routing Based on X-Architecture Pattern Matching Chia-Chun Tsai Professor Dept. of Computer...

1

Clock Routing Based on X-Architecture Pattern Matching

Chia-Chun TsaiProfessor

Dept. of Computer Science and Information Engineering

Nanhua University

Dept. of Computer Science and EngineeringYuan Ze University

Oct. 03, 2008

2

Outline

IntroductionProblem FormulationProposed AlgorithmExperimental ResultsConclusion

3

Introduction An interesting geometric problem (Clock routing problem).

› How to connect a particular point (clock source) to a number of points (clock sinks) such that each path from a particular point to the points is equal to each other.

Source

Sink

4

Cut 1

Cut 2Cut 3

The MMM (Method of Means and Medians ) algorithm presented with recursively partitioning.

MMM Approach [Jackson 90]

5

H-flip

The GMA (Geometric Matching Algorithm) based on bottom-up matching approach.

GMA Approach [Kahng 91]

6

The WCA (Weighted Center based Algorithm) searched next tapping point with new weighted center

WCA Approach [Bo 91]

7

The DME (Deferred Merge Embedding): The bottom-up phase constructs a tree of merging segments and the top-down embedding phase determines the exact location.

DME Approach [BK92, CHH92, Eda91]

The bottom-up phase in DME The top-down phase in DME

8

The GDME (Grey relation analysis for DME) for an illustration of 29 clock sinks.› Partition S by alternating x- and y-median based on MMM approach until the number of clock sinks in each partition zone, Z, is less or equal to four.

GDME Approach [Wu 07]

9

› Use the Grey relational analysis and associate with the DME approach. Then, recursively split and construct a minimum-cost clock tree.

GDME Approach (Cont’d)

10

Clock Routing for 512 Sinks

11

Clock Tree Construction for Benchmark r5

12

Clock Network in a Chip

Two factors for a clock network, clock delay and clock skew › Max clock delay dominates the operation frequency.› Clock skew (max clock delay – min clock delay) may fail chip functions.› Wanted: minimize the max clock delay and get exact-zero skew

Clock network

A typical architecture of SoC exists a physical clock network.

13

Wire Delay and Sink Loading Two typical delay models for a wire. r is a sheet resistance, ca is

a unit area capacitance, cf is a unit fringing capacitance, and CL is the load capacitance of a clock sink.

Elmore delay model (Elmore 48) The FED (Fitted Elmore delay model) (Abou-Seido 04)

14

Interconnection Delay

Interconnects dominate signal delay

Data from ITRS Roadmap

15

clock source

Source Steiner point Sink

4 4 4 4 2 2 3 3

7 8 7 5

20 25

31 31 32 32 34 34 33 33 Delay = 34

Skew = 34-31 = 3

Delay = max. delay

Skew = max. delay - min. delay

level 3

level 2

level 1

Clock Tree Topology

16

Manhattan routing (horizontal and vertical)› Lead to

- Long wire length on average - Worse performance dominated by interconnect delay

X-architecture routing › Reduce wire length› Proviso: manufacturing technology supports

diagonal routing direction.› TSMC and UMC are ready for 65-nm X-Architecture designs

EE Times, May 25, 2006.http://www.eetimes.com/news/design/showArticle.jhtml?articleID=188500129

Partial routing result: Primary 1 @ 0.13m

Manhattan vs. X-architecture Clock Routings

17

Layer definition in Manhattan and X Architectures

18

Compared Manhattan and X- Architectures

Manhattan vs X-architecture

Same area,higher performance

Same performance,

less area

19

X-architecture (horizontal, vertical and diagonal)

› L= [(x1-x2)2+ (y1-y2)2]1/2

› LM=L(sinα +cosα)› LX=L(0.41sinα+cosα)

› Benefits [Teig IWSLIP2002]:» 20% reduction in wire length» 20% saving in power» 10% improvement in chip performance» 30% reduction in die cost

Partial routing result: Primary 1 @ 0.13m

Arbitrary angle Manhattan arch. X-arch.

45°

(x1, y1)

(x2, y2)

Lα

(x1, y1)

(x2, y2)

LMα

(x2, y2)

LXα

(x1, y1)

Metal 2

Metal 3

Metal 4

Metal 1s1

s2

s1

s2

s1

s2

PB

Manhattan vs. X Architectures

20Routing result: r1 @ 0.13m

Our Contribution

Construct ZST (Zero Skew Tree) based on X-architecture and predefined 16 matching patterns

Simplify DME merging procedures X-flip shortens wire length Wire sizing reduces routing

resources

21

Outline


22

Problem Formulation

A general CRP (clock routing problem):

Given:

a set of n clock sinks, S = {s1, s2, … sn}

Objective: construct a ZST (Zero Skew clock Tree) based

on X-architecture with better performance.

23

DME-4 [Shen ISCAS06]

Associated with DME (Deferred Merge Embedding) [Chao TCAD92] Construct TOR (Tiled Octangular Region) in bottom-up phase of DME. Resolve the exact coordinates in top-down phase of DME. Use balanced bipartition to reduce wire length. Delay model: FED (Fitted Elmore Delay) [Abou-Seido TVLSI04]

radius1

s1

radius2

s2

TORs1

merging segment

radius1

The construction procedure should be more easy!

24

Metal 2

Metal 3 Metal 4

Metal 1

Node viaEdge via

NVM [Wang VLSI-DAT07]

Also use DME to construct ZST (Zero Skew Tree). Focus on NVM (Node Via Minimization). Reducing #via is crucial. Delay model: Elmore model

They use various layer definitions.

Not practical enough.

25

Definition of Our Clock Problem

Given: a set of clock sinks, S = {s1, s2, … sn} and a X-pattern library.

Objective: construct a ZST based on X-architecture with better performance.

Preliminary› Layer definition› One bend X-pattern› 16 X- patterns as a library

s2

PTN_2

s1

PTN_1

Zone location ofstart point

Zone location of end point

SLT SRT SLB SRB

LT PTN_R PTN_1 PTN_2 PTN_R

RT PTN_1 PTN_R PTN_R PTN_2

LB PTN_2 PTN_R PTN_R PTN_1

RB PTN_R PTN_2 PTN_1 PTN_R

26

Complete routing result:r1 @ 0.13m

X-Pattern

Main idea:› Clock source locates

near the center of routingarea.

Centralize all the routing

wires.

27

X-Pattern (cont’d)

Assumed that s1 and s2 are paired.› Step1. Tile the routing area. s1 locates in LT› Step2. Tile the routing area of s1. s2 locates in SRT› Step3. Define the X-pattern for 4 sub-zones.

s2

s2

s5

s3

s1s6

s8

s7

s4

LT

LB

RT

RB

s2

s1

SRTSLT

SRBSLB

s2

PTN_1

s1

PTN_2

PTN_1

s2

SRTSLT

s2

PTN_1PTN_2PTN_2

SRBSLB

28

X-Pattern (cont’d)

Zone location of start point

Zone location of end point

SLT SRT SLB SRB

LT

RT

LB

RB

s2

s2

s5

s3

s1s6

s8

s7

s4

LT

LB

RT

RB

s2

s1

SRTSLT

SRBSLB

s2

PTN_1

s1

PTN_2

PTN_1

s2

SRTSLT

s2

PTN_1PTN_2PTN_2

SRBSLB

PTN_R PTN_1 PTN_2 PTN_R

PTN_1

PTN_2

PTN_R

PTN_R

PTN_R

PTN_2

PTN_R

PTN_R

PTN_1

PTN_2

PTN_1

PTN_R

29

Outline


30

Proposed Algorithm

Algorithm PMXF Input: A set of S sinks and 16-kind of X-patterns for a pair of pointsOutput: A ZST based on X-architecture with zero skew and minimal

delaybegin1. While(|S| > 1)2. { (s1, s2) = DPPG(S); //Determine a pair of points using GMA

3. Pattern = CPXP(s1, s2)∩CPXP(s2, s1); //Choose proper X-pattern

4. Pt = DCTP(s1, s2, x); //Find tapping point Pt of s1 and s2

5. If (x<0) WireSizing(s1, s2); //Adjust w2

6. If (x>1) WireSizing(s2, s1); //Adjust w1

7. DME-X(s1, s2, Pt, Pattern); //Construct the clock tree

8. X-Flip(s1, s2); //Reduce wire length

9. Insert(S, Pt); //Insert Pt to S

10. }End

PMXF (Pattern-Matching based on X-clock routing with X-Flip) algorithm

31

DPPG Procedure

Determine Pair of Points in GMA› GMA is a bottom-up algorithm

[Kahng DAC91]› Focus on path-length balancing

X4

X6

X2

X1

X7

X8

X3

X5

X9

X10

X15X12

X11

X13

X14

DPPG

DPPG

DPP

G

DPPG

DPPG

DPPG

DPPG

Time complexity O(logn)

32

CPXP Procedure Choose Proper X-Pattern

› Ex. CPXP(X1, X2)› Step1. Tile the routing area x1 locates in LT› Step2. Tile the routing area

of start point, x1 x2 locates in SRT

› Step3. Map the given X-pattern table

CPXP(X1, X2)=PTN_1 CPXP(X2, X1)=PTN_R

CPXP(X1, X2)∩CPXP(X1,X2)=PTN_1

X4

X6

X2

X1

X7

X8

X3

X5

X9

X10

X15X12

X11

X13

X14

CPXP

LT

LB

RT

RB

SLT

SLB

SRT

SRB

CPXP

CPXP

CPXP

CPXP

CPXP

CPXP


33

DCTP Procedure

Determine Coordinate of Tapping Point› Tapping point, Pt is determined to achieve zero skew. [Tsay ICCAD91]› Zero skew condition ratio, x.› If 0≤x≤1, tapping point locates on wire.› If x< 0 or x>1, need snaking wire.› Use binary search to determine the coordinate. [Wu IEICE07]

Time complexity

O(n)

34

Wire Sizing

Snaking wire is one of public methods for constructing ZST. Benefits of adopting wire sizing [El-Moursy GLSVLSI03]

› Release routing resources› But need extra power due to wider wires

Snaking wire Sized wire

35

Wire Sizing (cont’d)

Consider the zero skew condition, x < 0.

]2

)([),,(FED),( 1

11

1

11111 L

faLt FC

lEcwDC

w

lrwlCPsdx

),( 2 tPsdx

)(2

)2

(

)(

2111

1

1

2222

2

llDC

FClEC

w

l

FClw

aL

f

L

lEC f

Time complexity O(n)

36

DME-X Procedure

Traditional DME based on X-arch. Bottom-up

phase› Create

TOR.

› Merge.

X4

X6

X2

X1

X7

X3

X5

X8

37

DME-X (cont’d)

Traditional DME based on X-arch. Bottom-up

phase› Create

TOR.

› Merge.

Top-down phase› Determine

points’ locations.› Connect all the nodes.

X4

X6

X2

X1

X7

X3

X5

X15X12

X11

X13

X14

X8

X9

X10

38

DME-X (cont’d)

Our DME-X method› Integrate bottom-up and top-down phases› Construct the

parallelogram

› DCTP(X4, X6)

› CPXP(X4, X6) ∩CPXP(X6, X4)

› Tip! Run CPXP firstthen DCTP for savingrunning time.

X4

X6

X2

X1

X7

X3

X5

X15X12

X11

X14

X13

X8

X9

X10

X9’DPPG


39

X-Flip Procedure

Exchange X-pattern based on

predefined patterns

s1

PTN_2

PTN_1

s2 s2

PTN_2

s1

PTN_1

Delay = 4454.614 ps

Cost = 38219.374 m

Power = 0.000531 w

Complete routing result:08-5 @ 0.13m

40

X-Flip (cont’d)

Check the length of the i-1th level when constructing the ith level.

Delay = 4139.209 ps, saving 7%

Cost = 36334.753 m, saving 4.9%

Power = 0.000515 w, saving 3%

Complete routing result:08-5 @ 0.13m with X-Flip


41

Time Complexity Analysis

Algorithm PMXFInput: A set of S sinks and 16-kind of X-patterns for a pair of pointsOutput: A ZST based on X-architecture with zero skew and minimal

delaybegin1. While(|S| > 1)2. { (s1, s2) = DPPG(S); //Determine a pair of points using GMA

3. Pattern = CPXP(s1, s2)∩CPXP(s2, s1); //Choose proper X-pattern

4. Pt = DCTP(s1, s2, x); //Find tapping point Pt of s1 and s2

5. If (x<0) WireSizing(s1, s2); //Adjust w2

6. If (x>1) WireSizing(s2, s1); //Adjust w1

7. DME-X(s1, s2, Pt, Pattern); //Construct the clock tree

8. X-Flip(s1, s2); //Reduce wire length

9. Insert(S, Pt); //Insert Pt to S

10. }End


Time complexity O(n logn)


42

Outline


43

Experimental Results

Platform: WinXP-SP2 on P4-M 1.7G with 1G Memory Compiler: Borland C++ Builder 6.0 IBM benchmarks, r1-r5, for testing our algorithm PMXF Our PMXF is compared with

› DME-4 [Shen ISCAS06] based on fitted Elmore delay model› NVM [Wang VLSI-DAT07] based on Elmore delay model

0.13m fabrication parameters are used.DME-4 (fitted Elmore) NVM (Elmore)

r 0.623Ω/m D 1.12673ln2 Fclk 100MHz r 0.623Ω/m

ca 0.00598fF/m E 1.10463ln2 Vdd 1.2V c 0.118fF/μm

cf 0.043fF/m F 1.04836ln2

44

Our GUI

45

Our GUI (cont’d)

46

Our Results based on FED Model

Benchmark

#Sinkdelay (s) wirelength (m) power (W) total via runtime (s)

PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX

r1 26 0.47415 0.310858 0.656 1406401 1383347 0.983 0.074957 0.076774 1.024 1248 1215 0.973 6.72 6.859 1.02

r2 598 1.130498 0.841717 0.744 3000575 2863408 0.954 0.194351 0.19197 0.987 2744 2816 1.026 28.12 31.535 1.121

r3 862 1.632144 1.790971 1.097 3750372 3651790 0.973 0.263825 0.25889 0.981 3993 4019 1.006 65.163 70.261 1.078

r4 1903 4.639215 3.989911 0.860 7593864 7221328 0.950 0.617959 0.599185 0.969 9170 9160 0.998 754.024 897.16 1.189

r5 3101 8.987384 7.881827 0.877 11322668 10855445 0.958 1.023591 0.998684 0.975 14624 14528 0.993 1840.375 2309.672 1.255

Average - - 0.847 - - 0.964 - - 0.987 - - 0.999 - - 1.126Improve 15.3% in delayImprove 3.6% in wire length and 1.3% in powerImprove 0.1% in total via, but need more 12.6% in runtime

Compare our PMXF algorithm without/ with X-Flip in terms of delay, wire length, power consumption,total via, and runtime for FED model

47

Our Results Based on ED Model

Benchmark

#Sinkdelay (s) wirelength (m) power (W) total via runtime (s)

PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX PMX PMXF PMXF/PMX

r1 26 0.165138 0.137993 0.836 1517071 1364700 0.900 0.165665 0.163641 0.988 1230 1229 0.999 6.930 7.491 1.081

r2 598 0.455854 0.320785 0.704 2917878 2788433 0.956 0.377402 0.374030 0.991 2846 2867 1.007 25.577 30.144 1.179

r3 862 0.526273 0.498202 0.947 3757000 3696636 0.984 0.525380 0.514669 0.980 4016 3993 0.994 60.213 66.957 1.112

r4 1903 1.822653 1.614070 0.886 7513128 7363705 0.980 1.208581 1.185089 0.981 9220 8912 0.967 732.046 790.046 1.079

r5 3101 2.577663 2.095517 0.813 11246479 10854213 0.965 1.978120 1.952519 0.987 14718 14546 0.988 1597.062 1689.281 1.058

Average - - 0.837 - - 0.957 - - 0.985 - - 0.991 - - 1.102Improve 16.3% in delayImprove 4.3% in wire length and 1.5% in powerImprove 0.9% in total via, but need more 10.2% in runtime

Compare our PMXF algorithm without/ with X-Flip in terms of delay, wire length, power consumption,total via, and runtime for ED model

48

Clock Tree Construction of r5 Based on PMXF

#sinks: 3101

Delay: 7.881827 s

Skew: 0

#vias: 14528

Power: 0.998684 W

Runtime: 2309.672s

49

Our Results Compared with DME-4

[8] W. Shen, Y. Cai, J. Hu, X. Hong, and B. Lu, “High Performance Clock Routing in X-architecture,” IEEE International Symposium On Circuits and Systems, 2006, pp. 2081-2084.

Compare our PMXF algorithm with DME-4[8] in terms of delay, wire length, and power consumption for FED model.

50


Benchmarks #sinksDelay (s)

DME-4[8] PMXF PMXF/DME-4

r1 267 0.471340 0.310858 0.659

r2 598 1.145970 0.841717 0.734

r3 862 1.664930 1.790971 1.075

r4 1903 4.631840 3.989911 0.861

r5 3101 9.053950 7.881827 0.871

Average - - 0.840

Improve 16% in delay

The comparison of our algorithm and DME-4[8] in delay

51


Benchmarks #sinksWire length (m)


r1 267 1414960 1383347 0.977

r2 598 2863420 2863408 0.999

r3 862 3656580 3651790 0.998

r4 1903 7245500 7221328 0.996

r5 3101 10971100 10855445 0.989

Average - - 0.992

Improve 0.8% in wire length

The comparison of our algorithm and DME-4[8] in wire length

52


Benchmarks #sinksPower (w)


r1 267 0.074594 0.076785 1.029

r2 598 0.180590 0.174153 0.964

r3 862 0.254845 0.258602 1.015

r4 1903 0.589042 0.583533 0.991

r5 3101 0.981078 0.909697 0.927

Average - - 0.985

Improve 1.5% in power

The comparison of our algorithm and DME-4[8] in power

53

Our Results Compared with NVM

[9] C. H. Wang and W. K. Mak, “λ-Geometry clock tree construction with wire length and via minimization,” IEEE International Symposium on VLSI-DAT, 2007, pp. 124-127.

Compare our PMXF algorithm with NVM[9] in terms of via and wire length for ED model.

54


Benchmarks

#sinksNode via Total via

NVM[9] PMXFNVM[9

]PMXF PMXF/NVM

r1 267 832 720 1486 1229 0.827

r2 598 1859 1658 3401 2867 0.843

r3 862 2689 2335 4921 3993 0.811

r4 1903 6046 5200 10769 8912 0.828

r5 3101 9818 8504 17681 14546 0.823

Average - - - - 0.826Improve 17.4% in total via

The comparison of our algorithm and NVM[9] in node/ total via

55

Benchmarks #sinksWire length (m)

NVM[8] PMXF PMXF/NVM

r1 267 1200300 1364700 1.137

r2 598 2354000 2788433 1.185

r3 862 3074900 3696636 1.202

r4 1903 6145000 7363705 1.198

r5 3101 9152300 10854213 1.186

Average - - 1.181

Worsen 18.1% in wire length


The comparison of our algorithm and NVM[9] in wire length

56

Explanation for Wire Length Why we performed worse than NVM in wire length?

› The wire length is determined in different topologies.

Merging rules used in NVM› N=|S|/c. where c is a constant.› Step1. sort the corresponding edges, eim, of sink i, si.

Where si∈S, eim E∈› Step2. Get the first N number in E.› Step3. Merge N number elements in S.› Step4. Remove the merged elements from S and add new merging

one.

57

Conclusion

X-architecture has been proven more effective than Manhattan architecture in constructing ZST.

We defined 16 X-routing patterns to simply the merging procedures in constructing X-based ZST.

X-flip can shorten wire length and minimize clock delay. Wire sizing removes snaking wires and save routing

resources. Our algorithm performs well in clock delay, wire length,

power and via cost.

58

Future Works

Insert buffers to get higher performance Consider the inductance effects in delay model Consider DFM problems

› Antenna effect› Optical correction› Redundant via insertion› CMP variation

Planar X-routing with less metal layers

59

Thank you for attendance!

1 Clock Routing Based on X-Architecture Pattern Matching Chia-Chun Tsai Professor Dept. of Computer...

Documents

Transcript of 1 Clock Routing Based on X-Architecture Pattern Matching Chia-Chun Tsai Professor Dept. of Computer...