Chapter 26 Buffer Insertion Basics Van Ginneken’s algorithm For a general case of signal nets,...

Chapter 26 Buffer Insertion Basics

Jiang Hu∗, Zhuo Li†and Shiyan Hu‡

1 Motivation

When the VLSI technology scales, gate delay and wire delay change in opposite directions. Smaller

devices imply less gate switching delay. In contrast, thinner wire size leads to increased wire re-

sistance and greater signal propagation delay along wires.As a result, wire delay has become

a dominating factor for VLSI circuit performance. Further,it is becoming a limiting factor to

the progress of VLSI technology. This is the well-known interconnect challenge [1–3]. Among

many techniques addressing this challenge [4,5], buffer (or repeater) insertion is such an effective

one that it is an indispensable necessity for timing closurein submicron technology and beyond.

Buffers can reduce wire delay by restoring signal strength, in particular, for long wires. Moreover,

buffers can be applied to shield capacitive load from timing-critical paths such that the interconnect

delay along critical paths are reduced.

As the ratio of wire delay to gate delay increases from one technology to the next, more and

more buffers are required to achieve performance goals. Thebuffer scaling is studied by Intel

∗Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843.†IBM Austin Research Lab, Austin, TX 78758.‡Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843.

1

and the results are reported in [6]. One metric that reveals the scaling iscritical buffer length

- the minimum distance beyond which inserting an optimally placed and sized buffer makes the

interconnect delay less than that of the corresponding unbuffered wire. When wire delay increases

due to the technology scaling, the critical buffer length becomes shorter, i.e., the distance that a

buffer can comfortably drive shrinks. According to [6], thebuffer critical length decreases by

68% when the VLSI technology migrates from90nm to 45nm (for two generations). Please note

that the critical buffer length scaling significantly outpaces the VLSI technology scaling which is

roughly0.5× for every two generations. If we look at the percentage of block level nets requiring

buffers, it grows from5.8% in 90nm technology to19.6% in 45nm technology [6]. Perhaps the

most alarming result is the scaling of buffer count [6] whichpredicts that35% of cells will be

buffers in45nm technology as opposed to only6% in 90nm technology.

The dramatic buffer scaling undoubtedly generates large and profound impact to VLSI circuit

design. With millions of buffers required per chip, almost nobody can afford to neglect the im-

portance of buffer insertion as compared to a decade ago whenonly a few thousands of buffers

are needed for a chip [7]. Due to this importance, buffer insertion algorithms and methodologies

need to be deeply studied on various aspects. First, a bufferinsertion algorithm should deliver

solutions of high quality since interconnect and circuit performance largely depend on the way

that buffers are placed. Second, a buffer insertion algorithm needs to be sufficiently fast so that

millions of nets can be optimized in reasonable time. Third,accurate delay models are necessary to

ensure that buffer insertion solutions are reliable. Fourth, buffer insertion techniques are expected

to simultaneously handle multiple objectives, such as timing, power and signal integrity, and their

tradeoffs. Last but not the least, buffer insertion should interact with other layout steps, such as

placement and routing, as the sheer number of buffers has already altered the landscape of circuit

2

layout design. Many of these issues will be discussed in subsequent sections and other chapters.

2 Optimization of Two-Pin Nets

For buffer insertion, perhaps the most simple case is a two-pin net, which is a wire segment with a

driver (source) at one end and a sink at the other end. The simplicity allows closed form solutions

to buffer insertion in two-pin nets.

If the delay of a two-pin net is to be minimized by using a single buffer typeb, one needs to

decide the number of buffersk and the spacing between the buffers, the source and the sink.First,

let us look at a very simple case in order to attain an intuitive understanding of the problem. In this

case, the length of the two-pin net isl and the wire resistance and capacitance per unit length arer

andc, respectively. The number of buffersk has been given and is fixed. The driver resistance is

the same as the buffer output resistanceRb. The load capacitance of the sink is identical to buffer

input capacitanceCb. The buffer has an intrinsic delay oftb. Thek buffers separates the net into

k + 1 segments, with length of~l = (l0, l1, ..., lk)T (see Figure 1). Then, the Elmore delay of this

net can be expressed as:

t(~l) =k

∑

i=0

(αl2i + βli + γ) (1)

whereα = 12rc, β = Rbc + rCb andγ = RbCb + tb. A formal problem formulation is

Minimize t(~l) (2)

Subject to g(~l) = l −∑k

i=0 li = 0 (3)

According to the Kuhn-Tucker condition [8], the following equation is the necessary condition for

3

the optimal solution.

~∇t(~l) + λ~∇g(~l) = 0 (4)

whereλ is the Lagrangian multiplier. According to the above condition, it can be easily derived

that

li =β

λ − 2α, i = 0, 1, ..., k (5)

Sinceα, β andλ are all constants, it can be seen thatthe buffers need to be equally spaced in order

to minimize the delay. This is an important conclusion that can be treated as a ruleof thumb. The

value of the Lagrangian multiplierλ can be found by plugging (5) into (3).

Driver Sink2 3 k

l l l0 1 2 k

1

l

Figure 1: Buffer insertion in a two-pin net.

In more general cases, the driver resistanceRd may be different from that of buffer output

resistance and so is the sink capacitanceCL. For such cases, the optimum number of buffers

minimizing the delay is given by [9]:

k =

⌊

−1

2+

√

1 +2(rcl + r(Cb − CL) − c(Rb − Rd))2

rc(RbCb + tb)

⌋

(6)

4

The length of each segment can be obtained through [9]:

l0 =1

k + 1

(

l +k(Rb − Rd)

r+

CL − Cb

c

)

(7)

l1 = ... = lk−1 =1

k + 1

(

l −Rb − Rd

r+

CL − Cb

c

)

lk =1

k + 1

(

l −Rb − Rd

r−

k(CL − Cb)

c

)

A closed form solution to simultaneous buffer insertion/sizing and wire sizing is reported in

[10]. Figure 2 shows an example of this simultaneous optimization. The wire is segmented into

m pieces. The lengthli and widthhi of each wire piecei are the variables to be optimized.

There arek buffers inserted between these pieces. The sizebi of each bufferi is also a decision

variable. A buffer location is indicated by its surroundingwire pieces. For example, if the set

of wire pieces between bufferi − 1 andi is Pi−1, the distance between the two buffers is equal

to∑

j∈Pi−1lj. There are two important conclusions [10] for the optimal solution that minimizes

the delay. First, all wire pieces have the same length, i.e.,li = lm

, i = 1, 2, ...,m. Second, for

wire piecesPi−1 = {pi−1,1, pi−1,2, ..., pi−1,mi−1} between bufferi − 1 andi, their widths satisfy

hi−1,1 > hi−1,2 > ... > hi−1,mi−1and form a geometric progression.

h1 h2

l2l1

b1 bk

lm

hm

segments0 segmentskm m

Figure 2: An example of simultaneous buffer insertion/sizing and wire sizing.

5

3 Van Ginneken’s algorithm

For a general case of signal nets, which may have multiple sinks, van Ginneken’s algorithm [11]

is perhaps the first systematic approach on buffer insertion. For a fixed signal routing tree and

given candidate buffer locations, van Ginneken’s algorithm can find the optimal buffering solution

that maximizes timing slack according to the Elmore delay model. If there aren candidate buffer

locations, its computation complexity isO(n2). Based on van Ginneken’s algorithm, numerous

extensions have been made, such as handling of multiple buffer types, tradeoff with power and

cost, addressing slew rate and crosstalk noise, using accurate delay models and speedup techniques.

These extensions will be covered in subsequent sections.

At a high level, van Ginneken’s algorithm [11] proceeds bottom-up from the leaf nodes toward

the driver along a given routing tree. A set of candidate solutions keep updated during the process,

where three operationsadding wire, inserting buffers and branch mergingmay be performed.

Meanwhile, theinferior solutions are pruned to accelerate the algorithm. After a set of candidate

solutions are propagated to the source, the solution with the maximum required arrival time is

selected as the final solution. For a routing tree withn buffer positions, the algorithm computes

the optimal buffering solution inO(n2) time.

A net is given as a binary routing treeT = (V,E), whereV = {s0}∪Vs∪Vn, andE ⊆ V ×V .

Vertex s0 is thesourcevertex and also the root ofT , Vs is the set ofsink vertices, andVn is the

set of internal vertices. In the existing literatures,s0 is also referred asdriver. Denote byT (v)

the subtree ofT rooted atv. Each sink vertexs ∈ Vs is associated with a sink capacitanceC(s)

and a required arrival time (RAT). Each edgee ∈ E is associated with lumped resistanceR(e)

and capacitanceC(e). A buffer library B containing all the possible buffer types which can be

6

assigned to a buffer position is also given. In this section,B contains only one buffer type. Delay

estimation is obtained using the Elmore delay model, which is described in Chapter 3. A buffer

assignmentγ is a mappingγ : Vn → B∪{b} whereb denotes that no buffer is inserted. The timing

buffering problem is defined as follows.

Timing Driven Buffer Insertion Problem: Given a binary routing treeT = (V,E), possible buffer

positions, and a buffer libraryB, compute a buffer assignmentγ such that the RAT at driver is

maximized.

3.1 Concept of Candidate Solution

A buffer assignmentγ is also called acandidate solutionfor the timing buffering problem. A

partial solution, denoted byγv, refers to an incomplete solution where the buffer assignment in

T (v) has been determined.

The Elmore delay fromv to any sinks in T (v) underγv is computed by

D(s, γv) =∑

e=(vi,vj)

(D(vi) + D(e)),

where the sum is taken over all edges along the path fromv to s. The slack of vertexv underγv is

defined as

Q(γv) = mins∈T (v)

{RAT (s) − D(s, γv)}.

At any vertexv, the effect of a partial solutionγv to its upstream part is characterized by a

(Q(γv), C(γv)) pair, whereQ is the slack atv underγv andC is the downstream capacitance

viewing atv underγv.

7

�� (a) Wire insertion.

� � � � �(b) Buffer insertion.

� � � � � � �� (c) Branch merging.

Figure 3: Operations in van Ginneken’s algorithm.

3.2 Generating Candidate Solutions

Van Ginneken’s algorithm proceeds bottom-up from the leaf nodes toward the driver alongT . A set

of candidate solutions, denoted byΓ, keep updated during this process. There are three operations

through solution propagation, namely, wire insertion, buffer insertion and branch merging (see

Figure 3). We are to describe them in turn.

3.2.1 Wire insertion

Suppose that a partial solutionγv at positionv propagates to an upstream positionu and there is no

branching point in between. If no buffer is placed atu, then only wire delay needs to be considered.

Therefore, the new solutionγu can be computed as

Q(γu) = Q(γv) − D(e),

C(γu) = C(γv) + C(e),

(8)

wheree = (u, v) andD(e) = R(e)(C(e)2

+ C(γv)).

8

3.2.2 Buffer insertion

Suppose that we add a bufferb atu. γu is then updated toγ′u where

Q(γ′u) = Q(γu) − (R(b) · C(γu) + K(b)),

C(γ′u) = C(b).

(9)

3.2.3 Branch merging

When two branchesTl andTr meet at a branching pointv, Γl andΓr, which correspond toTl and

Tr, respectively, are to be merged. The merging process is performed as follows. For each solution

γl ∈ Γl and each solutionγr ∈ Γr, generate a new solutionγ′ according to:

C(γ′) = C(γl) + C(γr),

Q(γ′) = min{Q(γl), Q(γr)}.

(10)

The smallerQ is picked since the worst-case circuit performance needs tobe considered.

3.3 Inferiority and Pruning Identification

Simply propagating all solutions by the above three operations makes the solution set grow expo-

nentially in the number of buffer positions processed. An effective and efficient pruning technique

is necessary to reduce the size of the solution set. This motivates an important concept -inferior

solution- in van Ginneken’s algorithm. For any two partial solutionsγ1, γ2 at the same vertexv, γ2

is inferior to γ1 if C(γ1) ≤ C(γ2) andQ(γ1) ≥ Q(γ2). Whenever a solution becomes inferior, it is

prunedfrom the solution set. Therefore, only solutions excel in atleast one aspect of downstream

capacitance and slack can survive.

9

For an efficient pruning implementation and thus an efficientbuffering algorithm, a sorted list

is used to maintain the solution set. The solution setΓ is increasingly sorted according toC, and

thusQ is also increasingly sorted ifΓ does not contain any inferior solutions.

By a straightforward implementation, when adding a wire, thenumber of candidate solutions

will not change; when inserting a buffer, only one new candidate solution will be introduced. More

efforts are needed to merge two branchesTl andTr at v. For each partial solution inΓl, find the

first solution with largerQ value inΓr. If such a solution does not exist, the last solution inΓr will

be taken. SinceΓl andΓr are sorted, we only need to traverse them once. Partial solutions inΓr are

similarly treated. It is easy to see that after merging, the number of solutions is at most|Γl|+ |Γr|.

As such, givenn buffer positions, at mostn solutions can be generated at any time. Consequently,

the pruning procedure at any vertex inT runs inO(n) time.

3.4 Pseudo-code

In van Ginneken’s algorithm, a set of candidate solutions are propagated from sinks to driver.

Along a branch, after a candidate buffer locationv is processed, all solutions are propagated to

its upstream buffer locationu through wire insertion. A buffer is then inserted to each solution to

obtain a new solution. Meanwhile, inferior solutions are pruned. At a branching point, solution

sets from all branches are merged by merging process. In thisway, the algorithm proceeds in

the bottom-up fashion and the solution with maximum required arrival time at driver is returned.

Given n buffer positions inT , van Ginneken’s algorithm can compute a buffer assignment with

maximum slack at driver inO(n2) time since any operation at any node can be performed inO(n)

time. Refer to Figure 4 for the pseudo-code of van Ginneken’s algorithm.

10

Algorithm: van Ginneken’s algorithm.Input: T : routing tree,B: buffer libraryOutput: γ which maximizes slack at driver1. for each sinks, build a solution set{γs}, where

Q(γs) = RAT (s) andC(γs) = C(s)2. for each branching point/drivervt in the order given by

a postorder traversal ofT , let T ′ be each of the branchesT1,T2 of vt andΓ′ be the solution set corresponding toT ′, do

3. for each wiree in T ′, in a bottom-up order, do4. for eachγ ∈ Γ′, do5. C(γ) = C(γ) + C(e)6. Q(γ) = Q(γ) − D(e)7. prune inferior solutions inΓ′

8. if the current position allows buffer insertion, then9. for eachγ ∈ Γ′, generate a new solutionγ′

10. setC(γ′) = C(b)11. setQ(γ′) = Q(γ) − R(b) · C(γ) − K(b)12. Γ′ = Γ′

⋃

{γ′} and prune inferior solutions13. // mergeΓ1 andΓ2 to Γvt

14. setΓvt= ∅

15. for eachγ1 ∈ Γ1 andγ2 ∈ Γ2, generate a new solutionγ′

16. setC(γ′) = C(γ1) + C(γ2)17. setQ(γ′) = min{Q(γ1), Q(γ2)}18. Γvt

= Γvt

⋃

{γ′} and prune inferior solutions19. returnγ with the largest slack

Figure 4: Van Ginneken’s algorithm.

3.5 Example

Let us look at a simple example to illustrate the work flow of van Ginneken’s algorithm. Refer to

Figure 5. Assume that there are three non-dominated solutions atv3 whose(Q,C) pairs are

(200, 10), (300, 30), (500, 50),

and there are two non-dominated solutions atv2 whose(Q,C) pairs are

(290, 5), (350, 20).

11

� ��

� ��

� �� Figure 5: An example for performing van Ginneken’s algorithm.

We first propagate them tov1 through wire insertion. Assume thatR(v1, v3) = 3 andC(v1, v3) =

2. Solution (200,10) atv3 becomes(200− 3 · (2/2 + 10), 10 + 2) = (167, 12) atv1. Similarly, the

other two solutions become(207, 32), (347, 52). Assume thatR(v2, v3) = 2 andC(v2, v3) = 2,

solutions atv2 become(278, 7), (308, 22) atv1.

We are now to merge these solutions atv1. Denote byΓl the solutions propagated fromv3 and

by Γr the solutions propagated fromv2. Before merging, partial solutions inΓl are

(167, 12), (207, 32), (347, 52),

and partial solutions inΓr are

(278, 7), (308, 22).

After branch merging, the new candidate partial solutions whoseQ are dictated by solutions inΓl

are

(167, 19), (207, 39), (308, 74),

12

and those dictated by solutions inΓr are

(278, 59), (308, 74).

After pruning inferior solutions, the solution set atv1 is

{(167, 19), (207, 39), (278, 59), (308, 74)}.

4 Van Ginneken Extensions

4.1 Handling Library with Multiple Buffers

We extend the standard van Ginneken’s algorithm to handle multiple buffers and buffer cost [12].

The buffer libraryB now contains various types of buffers. Each bufferb in the buffer library has

a costW (b), which can be measured by area or any other metric, dependingon the optimization

objective. A functionf : Vn → 2B specifies the types of buffers allowed at each internal vertex in

T . The cost of a solutionγ, denoted byW (γ), is defined asW (γ) =∑

b∈γ Wb. With the above

notations, our new problem can be formulated as follows.

Minimum Cost Timing Constrained Buffer Insertion Problem: Given a binary routing treeT =

(V,E), possible buffer positions defined usingf , and a buffer libraryB, to compute a minimal-

cost buffer assignmentγ such that the RAT at driver is smaller than a timing constraintα.

In contrast to the single buffer type case,W is introduced into the(Q,C) pair to handle buffer

cost, i.e., each solution is now associated with a(Q,C,W )-triple. As such, during the process of

bottom-up computation, additional efforts need to be made in updatingW : if γ′ is generated by

13

inserting a wire intoγ, thenW (γ′) = W (γ); if γ′ is generated by inserting a bufferb into γ, then

W (γ′) = W (γ) + W (b); if γ′ is generated by mergingγl with γr, thenW (γ′) = W (γl) + W (γr).

The definition of inferior solutions needs to be revised as well. For any two solutionsγ1, γ2 at

the same node,γ1 dominatesγ2 if C(γ1) ≤ C(γ2), W (γ1) ≤ W (γ2) andQ(γ1) ≥ Q(γ2). When-

ever a solution becomes dominated, it is pruned from the solution set. Therefore, only solutions

excel in at least one aspect of downstream capacitance, buffer cost and RAT can survive.

With the above modification, van Ginneken’s algorithm can easily adapt to the new problem

setup. However, since the domination is defined on a(Q,C,W ) triple rather than a(Q,C) pair,

more efficient pruning technique is necessary to maintain the efficiency of the algorithm. As such,

range search tree technique is incorporated [12]. This technique will be described in details in

Section 5.2.

4.2 Library with Inverters

So far, all buffers in the buffer library are non-inverting buffers. There can also have inverting

buffers, or simplyinverters. In terms of buffer cost and delay, inverter would provide cheaper

buffer assignment and better delay over non-inverting buffers. As regard to algorithmic design,

it is worth noting that introducing inverters into the buffer library brings the polarity issue to the

problem, as the output polarity of a buffer will be negated after inserting an inverter.

4.3 Polarity Constraints

When output polarity for driver is required to be positive or negative, we impose a polarity con-

straint to the buffering problem. To handle polarity constraints, during the bottom-up computation,

14

the algorithm maintains two solution sets, one for positiveand one for negative buffer input polar-

ity. After choosing the best solution at driver, the buffer assignment can be then determined by a

top-down traversal. The details of the new algorithm are elaborated as follows.

Denote the two solution sets at vertexv by Γ+v andΓ−

v corresponding to positive polarity and

negative polarity, respectively. Supposed that an inverter b− is inserted to a solutionγ+v ∈ Γ+

v ,

a new solutionγ′v is generated in the same way as before except that it will be placed intoΓ−

v .

Similarly, the new solution generated by insertingb− to a solutionγ−v ∈ Γ−

v will be placed intoΓ+v .

For inserting a non-inverting buffer, the new solution is placed in the same set as its origin.

The other operations are easier to handle. The wire insertion goes the same as before and two

solution sets are handled separately. Merging is carried out only among the solutions with the same

polarity, e.g., the positive-polarity solution set of leftbranch is merged with that of the right branch.

For inferiority check and solution pruning, only the solutions in the same set can be compared.

4.4 Slew and Capacitance Constraints

The slew rateof a signal refers to the rising or falling time of a signal switching. Sometimes

the slew rate is referred as signal transition time. The slewrate of almost every signal has to be

sufficiently small since a large slew rate implies large delay, large short circuit power dissipation

and large vunlerability to crosstalk noise. In practice, a maximal slew rate constraint is required

at the input of each gate/buffer. Therefore, this constraint needs to be obeyed in a buffering algo-

rithm [12–15].

A simple slew model is essentially equivalent to the Elmore model for delay. It can be explained

using a generic example which is a pathp from nodevi (upstream) tovj (downstream) in a buffered

15

tree. There is a buffer (or the driver)bu at vi, and there is no buffer betweenvi andvj. The slew

rateS(vj) at vj depends on both the output slewSbu,out(vi) at bufferbu and the slew degradation

Sw(p) along pathp (or wire slew), and is given by [16]:

S(vj) =√

Sbu,out(vi)2 + Sw(p)2. (11)

The slew degradationSw(p) can be computed with Bakoglu’s metric [17] as

Sw(p) = ln 9 · D(p), (12)

whereD(p) is the Elmore delay fromvi to vj.

The output slew of a buffer, such asbu at vi, depends on the input slew at this buffer and the

load capacitance seen from the output of the buffer. Usually, the dependence is described as a 2-D

lookup table. As a simplified alternative, one can assume a fixed input slew at each gate/buffer.

This fixed slew is equal to the maximum slew constraint and therefore is always satisfied but is a

conservative estimation. For fixed input slew, the output slew of bufferb at vertexv is then given

by

Sb,out(v) = Rb · C(v) + Kb, (13)

whereC(v) is the downstream capacitance atv, Rb andKb are empirical fitting parameters. This

is similar to empirically derived K-factor equations [18].We callRb the slew resistance andKb

the intrinsic slew of bufferb.

In a van Ginneken style buffering algorithm, if a candidate solution has a slew rate greater than

16

given slew constraint, it is pruned out and will not be propagated any more. Similar as the slew

constraint, circuit designs also limit the maximum capacitive load a gate/buffer can drive [15]. For

timing non-critical nets, buffer insertion is still necessary for the sake of satisfying the slew and

capacitance constraints. For this case, fast slew buffering techniques are introduced in [19].

4.5 Integration with Wire Sizing

In addition to buffer insertion, wire sizing is an effectivetechnique for improving interconnect

performance [20–24]. If wire size can take only discrete options, which is often the case in practice,

wire sizing can be directly integrated with van Ginneken style buffer insertion algorithm [12]. In

the bottom-up dynamic programming procedure, multiple wire width options need to be considered

when a wire is added (see Section 3.2.1). If there arek options of wire size, thenk new candidate

solutions are generated, one corresponding each wire size.However, including the wire sizing in

van Ginneken’s algorithm makes the complexity pseudo-polynomial [12].

In [25], layer assignment and wire spacing are considered inconjunction with wire sizing. A

combination of layer, width and spacing is called a wire code. All wires in a net have to use an

identical wire code. If each wire code is treated as a polarity, the wire code assignment can be

integrated with buffer insertion in the same way as handlingpolarity constraint (see Section 4.3).

In contrast to simultaneous wire sizing and buffer insertion [12], the algorithm complexity stays

polynomial after integrating wire code assignment [25] with van Ginneken’s algorithm.

Another important conclusion in [25] is about wire tapering. Wire tapering means that a wire

segment is divided into multiple pieces and each piece can besized individually. In contrast,

uniform wire sizing does not make such division and maintainthe same wire width for the entire

17

Tapered Wire Sizing

Uniform Wire Sizing

Figure 6: Wire sizing with tapering and uniform wire sizing.

segment. These two cases are illustrated in Figure 6. It is shown in [25] that the benefit of wire

tapering versus uniform wire sizing is very limited when combined with buffer insertion. It is

theoretically proved [25] that the signal velocity from simultaneous buffering with wire tapering

is at most 1.0354 times of that from buffering and uniform wire sizing. In short, wire tapering

improves signal speed by at most3.54% over uniform wire sizing.

4.6 Noise Constraints with Devgan Metric

The shrinking of minimum distance between adjacent wires has caused an increase in the cou-

pling capacitance of a net to its neighbors. A large couplingcapacitance can cause a switching net

to induce significant noise onto a neighboring net, resulting in an incorrect functional response.

Therefore, noise avoidance techniques must become an integral part of the performance optimiza-

tion environment.

The amount of coupling capacitance from one net to another isproportional to the distance that

the two nets run parallel to each other. The coupling capacitance may cause an input signal on

the aggressor net to induce a noise pulse on the victim net. Ifthe resulting noise is greater than

18

the tolerable noise margin (NM) of the sink, then an electrical fault results. Inserting buffers in

the victim net can separate the capacitive coupling into several independent and smaller portions,

resulting in smaller noise pulse on the sink and the input of the inserted buffers.

Before describing the noise-aware buffering algorithms, wefirst introduce the coupling noise

metric in Section 4.6.1.

4.6.1 Devgan’s coupling noise metric

Among many coupling noise models, Devgan’s metric [26] is particularly amenable for noise

avoidance in buffer insertion, because its computational complexity, structure, and incremental

nature is the same as the famous Elmore delay metric. Further, like the Elmore delay model, the

noise metric is a provable upper bound on coupled noise. Other advantages of the noise metric

include the ability to incorporate multiple aggressor netsand handle general victim and aggressor

net topologies. A disadvantage of the Devgan metric is that it becomes more pessimistic as the ratio

of the aggressor net’s transition time (at the output of the driver) to its delay decreases. However,

cases in which this ratio becomes very small are rare since a long net delay generally corresponds

to a large load on the driver, which in turn causes a slower transition time. The metric does not

consider the duration of the noise pulse either. In general,the noise margin of a gate is dependent

on both the peak noise amplitude and the noise pulse width. However, when considering failure at

a gate, peak amplitude dominates pulse width.

If a wire segmente in the victim net is adjacent witht aggressor nets, letλ1, ..., λt be the ratios

of coupling to wire capacitance from each aggressor net toe, and letµ1, ..., µt be the slopes of the

aggressor signals. The impact of a coupling from aggressorj can be treated as a current source

Ie,j = Ce ·λj ·µj whereCe is the wire capacitance of wire segmente. This is illustrated in Figure 7.

19

(b)(a)

Figure 7: Illustration of noise model.

The total current induced by the aggressors one is

Ie = Ce

t∑

j=1

(λj · µj) (14)

Often, information about neighboring aggressor nets is unavailable, especially if buffer inser-

tion is performed before routing. In this case, a designer may wish to perform buffer insertion

to improve performance while also avoiding future potential noise problems. When performing

buffer insertion in estimation mode, one might assume that:(1) there is a single aggressor net

which couples with each wire in the routing tree, (2) the slope of all aggressors isµ, and (3) some

fixed ratioλ of the total capacitance of each wire is due to coupling capacitance.

Let IT (v) be defined as the total downstream current see at nodev, i.e.,

IT (v) =∑

e∈ET (v)

Ie,

whereET (v) is the set of wire edges downstream of nodev. Each wire adds to the noise induced

on the victim net. The amount of additional noise induced from a wiree = (u, v) is given by

Noise(e) = Re(Ie

2+ IT (v)) (15)

20

whereRe is the wire resistance. The total noise seen at sinksi starting at some upstream nodev is

Noise(v − si) = RvIT (v) +∑

e∈path(v−si)

Noise(e) (16)

whereRv = 0 if there is no gate at nodev. The path fromv to si has no intermediate buffers.

Each nodev has a predetermined noise marginNM(v). If the circuit is to have no electrical

faults, the total noise propagated from each driver/bufferto each its sinksi must be less than the

noise margin forsi. We define the noise slack for every nodev as

NS(v) = minsi∈SIT (v)

NM(si) − Noise(v − si) (17)

whereSIT (v) is the set of sink nodes for the subtree rooted at nodev. Observe thatNS(si) =

NM(si) for each sinksi.

4.6.2 Algorithm of buffer insertion with noise avoidance

We begin with the simplest case of a single wire with uniform width and neighboring coupling

capacitance. Let us consider a wiree = (u, v). First, we need to ensureNS(v) ≥ RbIT (v) where

Rb is the buffer output resistance. If this condition is not satisfied, inserting a buffer even at node

v cannot satisfy the constraint of noise margin, i.e., bufferinsertion is needed within subtreeT (v).

If NS(v) ≥ RbIT (v), we next search for the maximum wirelengthle,max of e such that inserting a

buffer atu always satisfies noise constraints. The value ofle,max tells us the maximum unbuffered

length or the minimum buffer usage for satisfying noise constraints. LetR = Re/le be the wire

resistance per unit length andI = Ie/le be the current per unit length. According to [27], this value

21

can be determined by

le,max = −Rb

R−

IT (v)

I+

√

(Rb

R)2 + (

IT (v)

I)2 +

2NS(v)

I · R(18)

Depending on the timing criticality of the net, the noise-aware buffer insertion problem can be

formulated in two different ways: (A) minimize total buffercost subject to noise constraints; (B)

maximize timing slack subject to noise constraints.

The algorithm for (A) is a bottom-up dynamic programming procedure which inserts buffers

greedily as far apart as possible [27]. Each partial solution at nodev is characterized by a 3-tuple

of downstream noise currentIT (v), noise slackNS(v) and buffer assignmentM . In the solution

propagation, the noise current is accumulated in the same way as the downstream capacitance in

van Ginneken’s algorithm. Likewise, noise slack is treatedlike the timing slack (or required arrival

time). This algorithm can return an optimal solution for a multi-sink treeT = (V,E) in O(|V |2)

time.

The core algorithm of noise constrained timing slack maximization is similar as van Ginneken’s

algorithm except that the noise constraint is considered. Each candidate solution at nodev is

represented by a 5-tuple of downstream capacitanceCv, required arrival timeq(v), downstream

noise currentIT (v), noise slackNS(v) and buffer assignmentM . In addition to pruning inferior

solutions according to the(C, q) pair, the algorithm eliminates candidate solutions that violate the

noise constraint. At the source, the buffering solution notonly has optimized timing performance

but also satisfies the noise constraint.

22

4.7 Higher Order Delay Modeling

Many buffer insertion methods [11, 12, 28] are based on the Elmore wire delay model [29] and

a linear gate delay model for the sake of simplicity. However, the Elmore delay model often

overestimates interconnect delay. It is observed in [30] that Elmore delay sometimes has over

100% overestimation error when compared to SPICE. A criticalreason of the overestimation is

due to the neglection of the resistive shielding effect. In the example of Figure 8, the Elmore delay

from nodeA to B is equal toR1(C1 + C2) assuming thatR1 can see the entire capacitance ofC2

despite the fact thatC2 is somewhat shielded byR2. Consider an extreme scenario whereR2 = ∞

or there is open circuit between nodeB andC. Obviously, the delay fromA to B should beR1C1

instead of the Elmore delayR1(C1 + C2). The linear gate delay model is inaccurate due to its

neglection of nonlinear behavior of gate delay in addition to resistive shielding effect. In other

words, a gate delay is not a strictly linear function of load capacitance.

R1 R2

C2C1

B CA

Figure 8: Example of resistive shielding effect.

The simple and relatively inaccurate delay models are suitable only for early design stages

such as buffer planning. In post-placement stages, more accurate models are needed because (1)

optimal buffering solutions based on simple models may be inferior since actual delay is not being

optimized; (2) simplified delay modeling can cause a poor evaluation of the trade-off between total

buffer cost and timing improvement. In more accurate delay models, the resistive shielding effect

is considered by replacing lumped load capacitance with higher order load admittance estimation.

23

The accuracy of wire delay can be improved by including higher order moments of transfer func-

tion. An accurate and popular gate delay model is usually a lookup table employed together with

effective capacitance [31, 32] which is obtained based on the higher order load admittance. These

techniques will be described in more details as follows.

4.7.1 Higher order point admittance model

For an RC tree, which is a typical circuit topology in buffer insertion, the frequency domain point

admittance at a nodev is denoted asYv(s). It can be approximated by the third order Taylor

expansion

Yv(s) = yv,0 + yv,1s + yv,2s2 + yv,3s

3 + O(s4)

whereyv,0, yv,1, yv,2 andyv,3 are expansion coefficients. The third order approximation usually

provides satisfactory accuracy in practice. Its computation is a bottom-up procedure starting from

the leaf nodes of an RC tree, or the ground capacitors. For a capacitanceC connected to ground,

the admittance at its upstream end is simplyCs. Please note that the zeroth order coefficient is

equal to0 in an RC tree since there is no DC path connected to ground. Therefore, we only need

to propagatey1, y2 andy3 in the bottom-up computation. There are two cases we need to consider:

• Case 1: For a resistanceR, given the admittanceYd(s) of its downstream node, compute the

admittanceYu(s) of its upstream node (Figure 9(a)).

yu,1 = yd,1 yu,2 = yd,2 − Ry2d,1 yu,3 = yd,3 − 2Ryd,1yd,2 + R2y3

d,1 (19)

• Case 2: Given admittanceYd1(s) andYd2(s) corresponding to two branches, compute the

24

admittanceYu(s) after merging them (Figure 9(b)).

yu,1 = yd1,1 + yd2,1 yu,2 = yd1,2 + yd2,2 yu,3 = yd1,3 + yd2,3 (20)

(b)(a)

Y (s)uR

Y (s)d1

Y (s)d

Y (s)d2

uY (s)

Figure 9: Two scenarios of admittance propagation.

The third order approximation(y1, y2, y3) of an admittance can be realized as an RCπ−model

(Cu, Rπ, Cd) (Figure 10)where

Cu = y1 −y2

2

y3

Rπ = −y2

3

y32

Cd =y2

2

y3

(21)

C Cd

R

u

πY(s)

Figure 10: Illustration ofπ−model.

4.7.2 Higher order wire delay model

While the Elmore delay is equal to the first order moment of transfer function, the accuracy of

delay estimation can be remarkably improved by including higher order moments. For example,

25

the wire delay model [33] based on the first three moments and the closed-form model [34] using

the first two moments.

Since van Ginneken style buffering algorithms proceed in a bottom-up manner, bottom-up

moment computations are required. Figure 11(a) shows a wiree connected to a subtree rooted at

nodeB. Assume that the firstk momentsm(1)BC , m

(2)BC , ..., m(k)

BC have already been computed for

the path fromB to C. We wish to compute the momentsm(1)AC , m

(2)AC , ...,m(k)

AC so that theA ; C

delay can be derived.

wire e

/2e /2CeC

R

Cf

RπeA

eCj

B

/2Ce Cn

ReA B

Cf

Rπ D

(c)

e

D

R , C CBA

(a) (b)

Figure 11: Illustration of bottom-up moment computation.

The techniques in Section 4.7.1 are used to reduce the subtree atB to aπ− model(Cj, Rπ, Cf )

(Figure 11(b)). NodeD just denotes the point on the far side of the resistor connected toB and

is not an actual physical location. The RC tree can be further simplified to the network shown in

Figure 11(c). The capacitanceCj andCe/2 at B are merged to form a capacitor with valueCn.

The moments fromA to B can be recursively computed by the equation

m(i)AB = −Re(m

(i−1)AB + m

(i−1)AD Cf ) (22)

where the moments fromA to D are given by

m(i)AD = m

(i)AB − m

(i−1)AD RπCf (23)

26

andm(0)AB = m

(0)AD = 1. Now the moments fromA toC can be computed via moment multiplication

as follows.

m(i)AC =

i∑

j=0

(m(j)AB · m

(i−j)BC ) (24)

One property of Elmore delay that makes it attractive for timing optimization is that the de-

lays are additive. This property does not hold for higher order delay models. Consequently, a

non-critical sink in a subtree may become a critical sink depending the value of upstream resis-

tance [35]. Therefore, one must store the moments for all thepaths to downstream sinks during the

bottom-up candidate solution propagation.

4.7.3 Accurate gate delay

A popular gate delay model with decent accuracy consists of the following three steps:

1. Compute aπ−model of the driving point admittance for the RC interconnectusing the tech-

niques introduced in Section 4.7.1.

2. Given theπ−model and the characteristics of the driver, compute an effective capacitance

Ceff [31,32].

3. Based onCeff , compute the gate delay usingk−factor equations or lookup table [36].

4.8 Flip-flop Insertion

The technology scaling leads to decreasing clock period, increasing wire delay and growing chip

size. Consequently, it often takes multiple clock cycles forsignals to reach their destinations along

global wires. Traditional interconnect optimization techniques such as buffer insertion are inade-

27

quate in handling this scenario and flip-flop/latch insertion (or interconnect pipelining) becomes a

necessity.

In pipelined interconnect design, flip-flops and buffers areinserted simultaneously in a given

Steiner treeT = (V,E) [37,38]. The simultaneous insertion algorithm is similar to van Ginneken’s

dynamic programming method except that a new criterion - latency, needs to be considered. The la-

tency from the signal source to a sink is the number of flip-flops in-between. Therefore, a candidate

solution at nodev ∈ V is characterized by its latencyλv in addition to downstream capacitance

Cv, required arrival time (RAT)qv. Obviously, a small latency is preferred.

The inclusion of flip-flop and latency also requests other changes in a van Ginneken style

algorithm. When a flip-flop is inserted in the bottom-up candidate propagation, the RAT at the

input of this flip-flop is reset to clock period timeTφ. The latency of corresponding candidate

solution is also increased by 1. For the ease of presentation, clock skew and setup/hold time are

neglected without loss of generality. Then, the delay between two adjacent flip-flops cannot be

greater than the clock period timeTφ, i.e., the RAT cannot be negative. During the candidate

solution propagation, if a candidate solution has negativeRAT, it should be pruned without further

propagation. When merge two candidate solutions from two child branches, the latency of the

merged solution is the maximum of the two branch solutions.

There are two formulations for the simultaneous flip-flop andbuffer insertion problem. MiLa:

find the minimum latency that can be obtained. GiLa: find a flip-flop/buffer insertion implemen-

tation that satisfies given latency constraint. MiLa can be used for the estimation of interconnect

latency at the micro architectural level. After the micro-architecture design is completed, all inter-

connect must be designed so as to abide to given latency requirements by using GiLa.

The algorithm of MiLa [38] and GiLa [38] are shown in Figure 12and Figure 13, respectively.

28

In GiLa, theλu for a leaf nodeu is the latency constraint at that node. Usually,λu at a leaf is

a non-positive number. For example,λu = −3 requires that the latency from the source to node

u is 3. During the bottom-up solution propagation,λ is increased by 1 if a flip-flop is inserted.

Therefore,λ = 0 at the source implies that the latency constraint is satisfied. If the latency at the

source is greater than zero, then the corresponding solution is not feasible (line 2.6.1 of Figure 13).

If the latency at the source is less than zero, the latency constraint can be satisfied by padding

extra flip-flops in the corresponding solution (line 2.6.2.1of Figure 13). The padding procedure

is calledReFlop(Tu, k) which insertsk flip-flops in the root path ofTu. The root path is from

u to either a leaf node or a branch nodev and there is no other branch node in-between. The

flip-flops previously inserted on the root path and the newly insertedk flip-flops are redistributed

evenly along the path. When merge solutions from two branchesin GiLa, ReFlop is performed

(line 3.3-3.4.1 of Figure 13) for the solutions with smallerlatency to ensure that there is at least

one merged solution matching the latency of both branches.

5 Speedup Techniques

Due to dramatically increasing number of buffers inserted in the circuits, algorithms that can effi-

ciently insert buffers are essential for the design automation tools. In this chapter, several recent

proposed speedup results are introduced and the key techniques are described.

5.1 Recent Speedup Results

This chapter studies buffer insertion in interconnect witha set of possible buffer positions and a

discrete buffer library. In 1990, van Ginneken [11] proposed anO(n2) time dynamic programming

29

Algorithm: MiLa(Tu)/MiLa(Tu,v)

Input: Subtree rooted at nodeu or edge(u, v)

Output: A set of candidate solutionsΓu

Global: Routing treeT and buffer libraryB1. if u is a leaf,Γu = (Cu, qu, 0, 0) // q is required arrival time2. else ifu has one child nodev or the input isTu,v

2.1 Γv = MiLa(v)2.2 Γu = ∪γ∈Γv

(addWire((u, v), γ))2.3 Γb = ∅2.4 for eachb in B

2.4.1 Γ = ∪γ∈Γu(addBuffer(γ, b))

2.4.2 pruneΓ2.4.3 Γb = Γb ∪ Γ2.5 Γu = Γu ∪ Γb

3. else ifu has two child edges(u, v) and(u, z)3.1 Γu,v = MiLa(Tu,v), Γu,z = MiLa(Tu,z)3.2 Γu = Γu ∪ merge(Γu,v, Γu,z)4. pruneΓu

5. returnΓu

Figure 12: The MiLa algorithm.

algorithm for buffer insertion with one buffer type, wheren is the number of possible buffer po-

sitions. His algorithm finds a buffer insertion solution that maximizes the slack at the source. In

1996, Lillis, Cheng and Lin [12] extended van Ginneken’s algorithm to allowb buffer types in time

O(b2n2).

Recently, many efforts are taken to speedup the van Ginneken’s algorithm and its extension.

Shi and Li [39] improved the time complexity of van Ginneken’s algorithm toO(b2n log n) for

2-pin nets, andO(b2n log2 n) for multi-pin nets. The speedup is achieved by four novel tech-

niques: predictive pruning, candidate tree, fast redundancy check, and fast merging. To reduce the

quadratic effect ofb, Li and Shi [40] proposed an algorithm with time complexityO(bn2). The

speedup is achieved by the observation that the best candidate to be associated with any buffer

must lie on the convex hull of the(Q,C) plane and convex pruning. To utilize the fact that in real

applications most nets have small numbers of pins and large number of buffer positions, Li and

30

Algorithm: GiLa(Tu)/GiLa(Tu,v)

Input: SubtreeTu rooted at nodeu or edge(u, v)

Output: A set of candidate solutionsΓu

Global: Routing treeT and buffer libraryB1. if u is a leaf,Γu = (Cu, qu, λu, 0)2. else if nodeu has one child nodev or the input isTu,v

2.1 Γv = GiLa(Tv)2.2 Γu = ∪γ∈Γv

(addWire((u, v), γ))2.3 Γb = ∅2.4 for eachb in B

2.4.1 Γ = ∪γ∈Γu(addBuffer(γ, b))

2.4.2 pruneΓ2.4.3 Γb = Γb ∪ Γ2.5 Γu = Γu ∪ Γb

// Γu ≡ {Γx, ...,Γy}, x, y indicate latency2.6 if u is source2.6.1 ifx > 0, exit: the net is not feasible2.6.2 ify < 0, // insert−y more flops inΓu

2.6.2.1 Γu = ReFlop(Tu,−y)3. else ifu has two child edges(u, v) and(u, z)3.1 Γu,v = GiLa(Tu,v), Γu,z = GiLa(Tu,z)3.2 //Γu,v ≡ {Γx, ...,Γy}, Γu,z ≡ {Γm, ...,Γn}3.3 if y < m // insertm − y more flops inΓu,v

3.3.1 Γu,v = ReFlop(Tu,v, m − y)3.4 if n < x // insertx − n more flops inΓu,z

3.4.1 Γu,z = ReFlop(Tu,z, x − n)3.5 Γu = Γu ∪ merge(Γu,v, Γu,z)4. pruneΓu

5. returnΓu

Figure 13: The GiLa algorithm.

Shi [41] proposed a simpleO(mn) algorithms form-pin nets. The speedup is achieved by the

property explored in [40], convex pruning, a clever bookkeeping method and an innovative linked

list that allowO(1) time update for adding a wire or a candidate.

In the following subsections, new pruning techniques, efficient way to find the best candidates

when adding a buffer, and implicit data representations arepresented. They are the basic compo-

nent of many recent speedup algorithms.

31

5.2 Predictive Pruning

During the van Ginneken’s algorithm, a candidate is pruned out only if there is another candidate

that is superior in terms of capacitance and slack. This pruning is based on the information at

the current node being processed. However, all candidates at this node must be propagated further

upstream toward the source. This means the load seen at this node must be driven by some minimal

amount of upstream wire or gate resistance. By anticipating the upstream resistance ahead of time,

one can prune out more potentially inferior candidates earlier rather than later, which reduces the

total number of candidates generated. More specifically, assume that each candidate must be driven

by an upstream resistance of at leastRmin. The pruning based on anticipated upstream resistance

is called predictive pruning.

Definition 1 (Predictive pruning) Let α1 andα2 be two nonredundant candidates ofT (v) such

that C(α1) < C(α2) andQ(α1) < Q(α2). If Q(α2) − Rmin · C(α2) ≤ Q(α1) − Rmin · C(α1),

thenα2 is pruned.

Predictive pruning preserves optimality. The general situation is shown in Fig. 14. Letα1 and

α2 be candidates ofT (v1) that satisfy the condition in Definition 1. Usingα1 instead ofα2 will

not increase delay fromv to sinks inv2, . . . , vk. It is easy to seeC(v, α1) < C(v, α2). If Q at v is

determined byT (v1), we have

Q(v, α1) − Q(v, α2) = Q(v1, α1) − Q(v1, α2) − Rmin · (C(v1, α1) − C(v1, α2))

≥ 0

Therefore,α2 is redundant.

32

v1

T(v1)v

v2 v3 ... vk

Rmin

Figure 14: Ifα1 andα2 satisfy the condition in Definition 1 atv1, α2 is redundant.

Predictive pruning technique prunes more redundant solutions while guarantees optimality. It

is one of four key techniques of fast algorithms proposed in [39]. In [42], significant speedup is

achieved by simply extending predictive pruning techniqueto buffer cost. Aggressive predictive

pruning technique, which uses a resistance larger thanRmin to prune candidates, is proposed in [43]

to achieve further speedup with a little degradation of solution quality.

5.3 Convex Pruning

The basic data structure of van Ginneken’s algorithms is a sorted list of non-dominated candidates.

Both the pruning in van Ginneken’s algorithm and the predictive pruning are performed by com-

paring two neighboring candidates a time. However, more potentially inferior candidates can be

pruned out by comparing three neighboring candidate solutions simultaneously. For three solutions

in the sorted list, the middle one may be pruned according to convex pruning.

Definition 2 (Convex pruning) Letα1, α2 andα3 be three nonredundant candidates ofT (v) such

thatC(α1) < C(α2) < C(α3) andQ(α1) < Q(α2) < Q(α3). If

Q(α2) − Q(α1)

C(α2) − C(α1)<

Q(α3) − Q(α2)

C(α3) − C(α2), (25)

then we callα2 non-convex, and prune it.

33

Convex pruning can be explained by Figure 15. ConsiderQ as theY -axis andC as theX-axis.

Then candidates are points in the two-dimensional plane. Itis easy to see that the set of nonre-

dundant candidatesN(v) is a monotonically increasing sequence. Candidateα2 = (Q2, C2) in the

above definition is shown in Figure 15(a), and is pruned in Figure 15(b). The set of nonredundant

candidates after convex pruningM(v) is a convex hull.

c

q

c1 c2c3

q1

q2

q3

Pruned

(a)

c

q

c1 c3 c4

q1

q3

q4

(b)

c4

q4

Figure 15: (a) Nonredundant candidatesN(v). (b) Nonredundant candidatesM(v) after convexpruning.

For 2-pin nets, convex pruning preserves optimality. Letα1, α2 andα3 be candidates ofT (v)

that satisfy the condition in Definition 2. In Figure 15, let the slope betweenα1 andα2 (α2 andα3)

beρ1,2 (ρ2,3). If candidateα2 is not on the convex hull of the solution set, thenρ1,2 < ρ2,3. These

candidates must have certain upstream resistanceR including wire resistance and buffer/driver

resistance. IfR < ρ2,3, α2 must become inferior toα3 when both candidates are propagated to

the upstream node. Otherwise,R > ρ2,3 which impliesR > ρ1,2, and thereforeα2 must become

inferior toα1. In other words, if a candidate is not on the convex hull, it will be pruned either by the

solution ahead of it or the solution behind it. Please note that this conclusion only applies to 2-pin

nets. For multi-pin nets when the upstream could be a mergingvertex, nonredundant candidates

that are pruned by convex pruning could still be useful.

Convex pruning of a list of non-redundant candidates sorted in increasing(Q,C) order can be

performed in linear time by Graham’s scan. Furthermore, when a new candidate is inserted to the

34

list, we only need to check its neighbors to decide if any candidate should be pruned under convex

pruning. The time isO(1), amortized over all candidates.

In [40, 41], the convex pruning is used to form the convex hullof non-redundant candidates,

which is the key component of theO(bn2) algorithm andO(mn) algorithm. In [43], convex

pruning (called squeeze pruning) is performed on both 2-pinand multi-pin nets to prune more

solutions with a little degradation of solution quality.

5.4 Efficient Way to Find Best Candidates

Assumev is a buffer position, and we have computed the set of nonredundant candidatesN ′(v) for

T (v), whereN ′(v) does not include candidates with buffers inserted atv. Now we want to insert

buffers atv and computeN(v). DefinePi(v, α) as the slack atv if we add a buffer of typeBi for

any candidateα:

Pi(v, α) = Q(v, α) − R(Bi) · C(v, α) − K(Bi). (26)

If we do not insert any buffer, then every candidate inN ′(v) is a candidate inN(v). If we insert

a buffer, then for every buffer typeBi, i = 1, 2, . . . , b, there will be a new candidateβi:

Q(v, βi) = maxα∈N ′(v)

{Pi(v, α)},

C(v, βi) = C(Bi).

Define thebest candidatefor Bi as the candidateα ∈ N ′(v) such thatα maximizesPi(v, α)

among all candidates inN ′(v). If there are multipleα’s that maximizePi(v, α), choose the one

35

with minimumC. In van Ginneken’s algorithm, it takesO(bn) to find one best candidate at each

buffer position.

According to convex pruning, it is easy to see that all best candidates are on the convex hull.

The following lemma says that if we sort candidates in increasingQ andC order from left to right,

then as we add wires to the candidates, we always move to the left to find the best candidates.

Lemma 1 For anyT (v), let nonredundant candidates after convex pruning beα1, α2, . . . , αk, in

increasingQ andC order. Now add wiree to each candidateαj and denote it asαj + e. For any

buffer typeBi, if αj gives the maximumPi(αj) andαk gives the maximumPi(αk + e), thenk ≤ j.

The following lemma says the best candidate can be found by local search, if all candidates are

convex.

Lemma 2 For anyT (v), let nonredundant candidates after convex pruning beα1, α2, . . . , αk, in

increasingQ andC order. If Pi(αj−1) ≤ Pi(αj), Pi(αj) ≥ Pi(αj+1), thenαj is the best candidate

for buffer typeBi and

Pi(α1) ≤ · · · ≤ Pi(αj−1) ≤ Pi(αj),

Pi(αj) ≥ Pi(αj+1) ≥ · · · ≥ Pi(αk).

With the above two lemmas and convex pruning, best candidates are founded in amortized

O(n) time in [40] andO(b) time in [41]1, which are more efficient than van Ginneken’s algorithm.

1In [40], Lemma 1 is presented differently. It says if all buffers are sorted decreasingly according to drivingresistance, then the best candidates for each buffer type insuch order is from left to right.

36

5.5 Implicit Representation

Van Ginnken’s algorithm uses explicit representation to store slack and capacitance values, and

therefore it takesO(bn) time when adding a wire. It is possible to use implicit representation to

avoid explicit updating of candidates.

In the implicit representation,C(v, α) andQ(v, α) are not explicitly stored for each candidate.

Instead, each candidate contains 5 fields:q, c, qa, ca andra 2. Whenqa, ca andra are all 0,q

andc giveQ(v, α) andC(v, α), respectively. When a wire is added, onlyqa, ca andra in the root

of the tree ( [39]) or as global variables themselves ( [41]) are updated. Intuitively,qa represents

extra wire delay,ca represents extra wire capacitance andra represents extra wire resistance.

It takes onlyO(1) time to add a wire with the implicit representation [39, 41].For example,

in [41], when we reach an edgee with resistanceR(e) andC(e), qa, ra andca are updated to

reflect new values ofQ andC of all previous candidates inO(1) time, without actually touching

any candidate:

qa = qa + R(e) · C(e)/2 + R(e) · ca,

ca = ca + C(e),

ra = ra + R(e).

2In [41], only 2 fields,q andc, are necessary for each candidate.qa, ca andra are global variables for each 2-pinsegment.

37

The actual value ofQ andC of each candidateα, are decided as follows

Q(α) = q − qa − ra · c,

C(α) = c + ca. (27)

Implicit representation is applied on balance tree in [39],where the operation of adding a wire

takesO(b log n) time. It is applied on a sorted linked list in [41], where the operation of adding a

wire takesO(1) time.

References

[1] J. Cong. An interconnect-centric design flow for nanometer technologies.Proceedings of

IEEE, 89(4):505–528, April 2001.

[2] J. A. Davis, R. Venkatesan, A. Kaloyeros, M. Beylansky, S. J. Souri, K. Banerjee, K. C.

Saraswat, A. Rahman, R. Reif, and J. D. Meindl. Interconnect limits on gigascale integration

(GSI) in the 21st century.Proceedings of IEEE, 89(3):305–324, March 2001.

[3] R. Ho, K. W. Mai, and M. A. Horowitz. The future of wires.Proceedings of IEEE, 89(4):490–

504, April 2001.

[4] A. B. Kahng and G. Robins.On optimal interconnections for VLSI. Kluwer Academic

Publishers, Boston, MA, 1995.

[5] J. Cong, L. He, C.-K. Koh, and P. H. Madden. Performance optimization of VLSI intercon-

nect layout.Integration: the VLSI Journal, 21:1–94, 1996.

38

[6] P. Saxena, N. Menezes, P. Cocchini, and D. A. Kirkpatrick.Repeater scaling and its impact

on CAD. IEEE Transactions on Computer-Aided Design, 23(4):451–463, April 2004.

[7] J. Cong. Challenges and opportunities for design innovations in nanometer technologies.

SRC Design Sciences Concept Paper, 1997.

[8] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty.Nonlinear programming: theory and algo-

rithms. John Wiley and Sons, 1993.

[9] C. J. Alpert and A. Devgan. Wire segmenting for improved buffer insertion. InProceedings

of the ACM/IEEE Design Automation Conference, pages 588–593, 1997.

[10] C. C. N. Chu and D. F. Wong. Closed form solution to simultaneous buffer insertion/sizing

and wire sizing.ACM Transactions on Design Automation of Electronic Systems, 6(3):343–

371, July 2001.

[11] L. P. P. P. van Ginneken. Buffer placement in distributedRC-tree networks for minimal

Elmore delay. InProceedings of the IEEE International Symposium on Circuitsand Systems,

pages 865–868, 1990.

[12] J. Lillis, C. K. Cheng, and T. Y. Lin. Optimal wire sizing and buffer insertion for low power

and a generalized delay model.IEEE Journal of Solid-State Circuits, 31(3):437–447, March

1996.

[13] N. Menezes and C.-P. Chen. Spec-based repeater insertionand wire sizing for on-chip in-

terconnect. InProceedings of the International Conference on VLSI Design, pages 476–483,

1999.

39

[14] L.-D. Huang, M. Lai, D. F. Wong, and Y. Gao. Maze routing with buffer insertion under

transition time constraints.IEEE Transactions on Computer-Aided Design, 22(1):91–95,

January 2003.

[15] C. J. Alpert, A. B. Kahng, B. Liu, I. I. Mandoiu, and A. Z. Zelikovsky. Minimum buffered

routing with bounded capacitive load for slew rate and reliability control. IEEE Transactions

on Computer-Aided Design, 22(3):241–253, March 2003.

[16] C. Kashyap, C. J. Alpert, F. Liu, and A. Devgan. Closed form expressions for extending step

delay and slew metrics to ramp inputs. InProceedings of the ACM International Symposium

on Physical Design, pages 24–31, 2003.

[17] H. B. Bakoglu. Circuits, interconnections and packaging for VLSI. Addison-Wesley, Read-

ing, MA, 1990.

[18] N. H. E. Weste and K. Eshraghian.Principles of CMOS VLSI design: a system perspective.

Addison-Wesley Publishing Company, Reading, MA, 1993.

[19] S. Hu, C. J. Alpert, J. Hu, S. Karandikar, Z. Li, W. Shi, andC.-N. Sze. Fast algorithms for slew

constrained minimum cost buffering. InProceedings of the ACM/IEEE Design Automation

Conference, pages 308–313, 2006.

[20] J. Cong and C. K. Koh. Simultaneous driver and wire sizing for performance and power

optimization.IEEE Transactions on VLSI Systems, 2(4):408–425, December 1994.

[21] S. S. Sapatnekar. RC interconnect optimization under the Elmore delay model. InProceed-

ings of the ACM/IEEE Design Automation Conference, pages 392–396, 1994.

40

[22] J. Cong and K.-S. Leung. Optimal wiresizing under the distributed Elmore delay model.

IEEE Transactions on Computer-Aided Design, 14(3):321–336, March 1995.

[23] J. P. Fishburn and C. A. Schevon. Shaping a distributed RC line to minimize Elmore delay.

IEEE Transactions on Circuits and Systems, 42(12):1020–1022, December 1995.

[24] C. P. Chen, Y. P. Chen, and D. F. Wong. Optimal wire-sizing formula under the Elmore delay

model. InProceedings of the ACM/IEEE Design Automation Conference, pages 487–490,

1996.

[25] C. J. Alpert, A. Devgan, J. P. Fishburn, and S. T. Quay. Interconnect synthesis without wire

tapering.IEEE Transactions on Computer-Aided Design, 20(1):90–104, January 2001.

[26] A. Devgan. Efficient coupled noise estimation for on-chip interconnects. InProceedings of

the IEEE/ACM International Conference on Computer-Aided Design, pages 147–151, 1997.

[27] C. J. Alpert, A. Devgan, and S. T. Quay. Buffer insertion for noise and delay optimization.

IEEE Transactions on Computer-Aided Design, 18(11):1633–1645, November 1999.

[28] C. C. N. Chu and D. F. Wong. A new approach to simultaneous buffer insertion and wire siz-

ing. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design,

pages 614–621, 1997.

[29] W. C. Elmore. The transient response of damped linear networks with particular regard to

wideband amplifiers.Journal of Applied Physics, 19:55–63, January 1948.

41

[30] F. J. Liu, J. Lillis, and C. K. Cheng. Design and implementation of a global router based on a

new layout-driven timing model with three poles. InProceedings of the IEEE International

Symposium on Circuits and Systems, pages 1548–51, 1997.

[31] J. Qian, S. Pullela, and L. T. Pillage. Modeling the effective capacitance for the RC inter-

connect of CMOS gates.IEEE Transactions on Computer-Aided Design, 13(12):1526–1535,

December 1994.

[32] S. R. Nassif and Z. Li. A more effectiveCeff . In Proceedings of the IEEE International

Symposium on Quality Electronic Design, pages 648–653, 2005.

[33] B. Tutuianu, F. Dartu, and L. Pileggi. Explicit RC-circuitdelay approximation based on

the first three moments of the impulse response. InProceedings of the ACM/IEEE Design

Automation Conference, pages 611–616, 1996.

[34] C. J. Alpert, F. Liu, C. V. Kashyap, and A. Devgan. Closed-form delay and slew metrics made

easy.IEEE Transactions on Computer-Aided Design, 23(12):1661–1669, December 2004.

[35] C. J. Alpert, A. Devgan, and S. T. Quay. Buffer insertion with accurate gate and interconnect

delay computation. InProceedings of the ACM/IEEE Design Automation Conference, pages

479–484, 1999.

[36] C.-K. Cheng, J. Lillis, S. Lin, and N. Chang.Interconnect analysis and synthesis. Wiley

Interscience, New York, NY, 2000.

[37] S. Hassoun, C. J. Alpert, and M. Thiagarajan. Optimal buffered routing path constructions for

single and multiple clock domain systems. InProceedings of the IEEE/ACM International

Conference on Computer-Aided Design, pages 247–253, 2002.

42

[38] P. Cocchini. A methodology for optimal repeater insertion in pipelined interconnects.IEEE

Transactions on Computer-Aided Design, 22(12):1613–1624, December 2003.

[39] W. Shi and Z. Li. A fast algorithm for optimal buffer insertion. IEEE Transactions on

Computer-Aided Design, 24(6):879–891, June 2005.

[40] Z. Li and W. Shi. AnO(bn2) time algorithm for buffer insertion withb buffer types.IEEE

Transactions on Computer-Aided Design, 25(3):484–489, March 2006.

[41] Z. Li and W. Shi. AnO(mn) time algorithm for optimal buffer insertion of nets withm

sinks. InProceedings of Asia and South Pacific Design Automation Conference, pages 320–

325, 2006.

[42] W. Shi, Z. Li, and C. J. Alpert. Complexity analysis and speedup techniques for optimal buffer

insertion with minimum cost. InProceedings of Asia and South Pacific Design Automation


[43] Z. Li, C. N. Sze, C. J. Alpert, J. Hu, and W. Shi. Making fast buffer insertion even faster

via approximation techniques. InProceedings of Asia and South Pacific Design Automation


43

Chapter 26 Buffer Insertion Basics Van Ginneken’s algorithm For a general case of signal nets,...

Documents

Transcript of Chapter 26 Buffer Insertion Basics Van Ginneken’s algorithm For a general case of signal nets,...