Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous...

33
Simultaneous Placement with Clustering and Duplication GANG CHEN Magma Design Automation and JASON CONG UCLA Clustering, duplication, and placement are critical steps in a cluster-based FPGA design flow. Clus- tering has a great impact on the wirelength, timing, and routability of a circuit. Logic duplication is an effective method for improving performance while maintaining the logic equivalence of a circuit. Based on several novel algorithmic contributions, we present an efficient and effective algorithm named SPCD (simultaneous placement with clustering and duplication) which performs clustering and duplication during placement for wirelength and timing minimization. First, we incorporate a path counting-based net weighting scheme for more effective timing optimization. Secondly, we introduce a novel method of moving a fragment of a cluster (called a fragment level move) during placement to optimize the clustering structure. To reduce the critical path detour during legal- ization from a more global perspective, we also introduce the notions of a monotone region and a global monotone region in which improvement to the local/global path detour is guaranteed. Fur- thermore, we introduce a notion of a constrained gain graph to embed all complex FPGA clustering constraints, and implement an optimal incremental legalization algorithm under such constraints. Finally, in order to reduce the circuit area, we formulate a timing-constrained global redundancy removal problem and propose a heuristic solution. Our SPCD algorithm outperforms a widely used academic FPGA placement flow, T-VPack + VPR, with an average reduction of 31% in the longest path estimate delay and 18% in the routed delay. We also apply our SPCD algorithm to Altera’s Stratix architecture in a commercial FPGA implementation flow (Quartus II 4.0). The routed result achieved by our SPCD algorithm outperforms VPR by 20% and outperforms Quartus II 4.0 by 4%. Categories and Subject Descriptors: B.7.2 [Integrated Circuits]: Design Aids—Placement and routing; G.4 [Mathematics of Computing]: Mathematical of Software—Algorithm design and analysis; J.6 [Computer Applications]: Computer-Aided Engineering—Computer-aided design (CAD) This research was partially funded by NSF Grant CCF-0096383 and by a grant from Magma Design Automation under the California MICRO Program. This research was performed as part of G. Chen’s Ph.D. study at UCLA. Portions of this article were published in Chen and Cong [2004, 2005]. Authors’ addresses: G. Chen, Magma Design Automation, 5460 Bayfront Plaza, Santa Clara, CA 95054; email: [email protected]; J. Cong, Computer Science Department, University of California at Los Angeles, Campus Mailcode 159610, Los Angeles, CA 90095-1596; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. C 2006 ACM 1084-4309/06/0700-0740 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006, Pages 740–772.

Transcript of Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous...

Page 1: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clusteringand Duplication

GANG CHEN

Magma Design Automation

and

JASON CONG

UCLA

Clustering, duplication, and placement are critical steps in a cluster-based FPGA design flow. Clus-tering has a great impact on the wirelength, timing, and routability of a circuit. Logic duplication isan effective method for improving performance while maintaining the logic equivalence of a circuit.Based on several novel algorithmic contributions, we present an efficient and effective algorithmnamed SPCD (simultaneous placement with clustering and duplication) which performs clusteringand duplication during placement for wirelength and timing minimization. First, we incorporatea path counting-based net weighting scheme for more effective timing optimization. Secondly, weintroduce a novel method of moving a fragment of a cluster (called a fragment level move) duringplacement to optimize the clustering structure. To reduce the critical path detour during legal-ization from a more global perspective, we also introduce the notions of a monotone region and aglobal monotone region in which improvement to the local/global path detour is guaranteed. Fur-thermore, we introduce a notion of a constrained gain graph to embed all complex FPGA clusteringconstraints, and implement an optimal incremental legalization algorithm under such constraints.Finally, in order to reduce the circuit area, we formulate a timing-constrained global redundancyremoval problem and propose a heuristic solution. Our SPCD algorithm outperforms a widely usedacademic FPGA placement flow, T-VPack + VPR, with an average reduction of 31% in the longestpath estimate delay and 18% in the routed delay. We also apply our SPCD algorithm to Altera’sStratix architecture in a commercial FPGA implementation flow (Quartus II 4.0). The routed resultachieved by our SPCD algorithm outperforms VPR by 20% and outperforms Quartus II 4.0 by 4%.

Categories and Subject Descriptors: B.7.2 [Integrated Circuits]: Design Aids—Placement androuting; G.4 [Mathematics of Computing]: Mathematical of Software—Algorithm design andanalysis; J.6 [Computer Applications]: Computer-Aided Engineering—Computer-aided design(CAD)

This research was partially funded by NSF Grant CCF-0096383 and by a grant from MagmaDesign Automation under the California MICRO Program. This research was performed as partof G. Chen’s Ph.D. study at UCLA. Portions of this article were published in Chen and Cong [2004,2005].Authors’ addresses: G. Chen, Magma Design Automation, 5460 Bayfront Plaza, Santa Clara, CA95054; email: [email protected]; J. Cong, Computer Science Department, University of Californiaat Los Angeles, Campus Mailcode 159610, Los Angeles, CA 90095-1596; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]© 2006 ACM 1084-4309/06/0700-0740 $5.00

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006, Pages 740–772.

Page 2: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 741

General Terms: Algorithms, Design, Performance

Additional Key Words and Phrases: Placement, clustering, duplication, legalization, redundancyremoval, FPGA

1. INTRODUCTION

Field programmable gate arrays (FPGAs) have become more and more popularin recent years because of their short time-to-market, field programmability,ease of use, and low cost in small- to medium-volume production. A typicaltype of FPGA is based on a K -input lookup-table (LUT), which can implementany K -input function. A typical FPGA architecture described by Betz and Rose[1997], the LUT-Based FPGA family, contains two levels of physical hierarchy:basic logic elements (BLE) and cluster-based logic blocks (CLB). As describedin Figure 1, each BLE contains a K -input LUT and a flip-flop (FF), and the LUTand FF share the same output. As described in Figure 2, each CLB containsN BLEs, I inputs and N outputs. Each of the I inputs can drive all the BLEs,and each BLE drives an output. Here, K , N , and I are parameters describedby an architecture file. The interconnect delay between BLEs within the sameCLB is usually much smaller than the delay between BLEs in different CLBs.We call this a baseline architecture.

Many commercial LUT-based FPGA architectures are similar to the baselinearchitecture, and some of them have two or more levels of physical hierarchy.For instance, Altera’s Stratix architecture consists of logic elements (LEs) andlogic array blocks (LABs). As shown in Figure 3, an LE is the smallest logicunit in the Stratix architecture, and it corresponds to a BLE in the baselinearchitecture. An LE contains a four-input LUT, a programmable register, anda carry chain. Unlike a BLE, the LUT and FF in an LE have separate outputs.A LAB is equivalent to a CLB in the baseline architecture, and Figure 4 showsits structure in the Stratix device. Each LAB consists of ten LEs and formsthe second level of the physical hierarchy. For more details, please refer to theStratix Device Handbook [2006].

A typical FPGA physical implementation flow consists of the following steps:clustering/packing, logic duplication, placement, and routing. The clusteringstep packs LUTs and FFs into CLBs according to the connectivity and timingof a mapped netlist; the duplication step clones one or more logic cells on criticalpaths to improve speed; the placement step assigns locations to all the nodes ofa clustered netlist; and the routing step connects all the nets of a placed netlist.

For both Figures 5 and 6, we define a target architecture: Each CLB containstwo BLEs, the interconnect delay is proportional to a Manhattan distance, andboth logic and intracluster delays are 0. As illustrated in Figure 5, clusteringhas a great impact on placement. The initial network in Figure 5(a) consistsof four FFs and two LUT+FFs. As shown in Figure 1, each LUT+FF containsa LUT directly driving an FF. An optimal clustering solution in Figure 5(b),with a minimum area of three clusters and a minimum logic level of one, canbe obtained from T-VPack [Marquardt et al. 1997]. However, if the target de-vice contains only two rows and two columns of CLBs, the optimal placement

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 3: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

742 • G. Chen and J. Cong

Fig. 1. Basic logic element (BLE).

Fig. 2. Cluster-Based logic block (CLB).

Fig. 3. Stratix logic element [Stratix Device Handbook 2006].

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 4: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 743

Fig

.4.

Str

atix

LA

Bst

ruct

ure

[Str

atix

Dev

ice

Han

dboo

k20

06].

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 5: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

744 • G. Chen and J. Cong

Fig. 5. Impact of clustering on placement.

Fig. 6. Impact of duplication on placement.

solution in Figure 5(c) on this optimal clustering yields a longest path delay of2.0. Instead, if we perform clustering together with placement, we can obtainthe clustering and placement solution in Figure 5(d) with a longest path delayof 1.0.

As illustrated in Figure 6, duplication has a great impact on placement aswell. The initial network in Figure 6(a) consists of five FFs and one LUT. We

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 6: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 745

use the same architecture as in Figure 5, and assume that the target devicecontains one row and three columns of CLBs. An optimal clustering solutionin Figure 6(b), with a minimum area of three clusters and a minimum logiclevel of two, can be obtained from T-VPack. However, the optimal placementsolution in Figure 6(c) on this optimal clustering yields a longest path delayof 2.0, which cannot be improved by postplacement duplication. Instead, if weperform duplication together with placement, we can obtain the solution inFigure 6(d) with a longest path delay of 1.0.

2. REVIEW OF EXISTING WORK

Packing LUTs and FFs into CLBs is a critical step in a cluster-based FPGAdesign flow, since it has a great impact on wirelength, timing, and routabil-ity. VPack [Betz and Rose 1997] packs each logic block to its capacity tominimize the number of clusters and encourages input sharing to minimizenumber of connections between clusters. The timing-driven version, T-Vpack[Marouardt et al. 1999], minimizes the number of connections on the criti-cal path, since internal connections are normally much faster than externalones, Rpack [Bozorgzadeh et al. 2001] introduces an effective routability metricand presents a routability-driven clustering algorithm for cluster-based FPGAs.PRIME [Cong et al. 1999] integrates retiming with performance-driven clus-tering/partitioning. Given an area bound for each cluster, PRIME generates aquasioptimal solution if duplication is allowed.

Logic duplication is a common technique for improving circuit performanceby cloning one or more logic cells while maintaining the logic equivalence of acircuit. In the past, logic duplication for timing optimization has been studiedin the following contexts. First, logic duplication is applied before placement inthe logic synthesis domain. The speed of a circuit can be improved by replicatinghigh fanout logic gates on the critical path to isolate critical sinks from noncriti-cal ones [Lillis et al. 1996; Srivastava et al. 2000]. Lillis et al. [1996] performedgate replication to reduce the delay and area of a circuit under certain tim-ing requirements. In Srivastava et al. [2000], the authors present an effectiveheuristic algorithm for the gate replication problem under a load-dependentdelay model. They show that both global and local (fanout partitioning) logicduplication problems for delay optimization are NP-complete. The gate repli-cation technique complements the popular gate sizing approach in the ASICflow.

Secondly, logic duplication is applied after placement as a postprocessing stepto further increase design performance for FPGAs. Beraudo and Lillis [2003]observe that coordinates of cells on a critical path often do not follow a monotoneorder and they propose a heuristic replication algorithm to straighten them (re-duce detour) locally. A legalization engine based on the “ripple-move” approachin Mongrel [Hur and Lillis 2000] is used to legalize the placement incremen-tally. The average delay reduction over VPR obtained from [Beraudo and Lillis2003] is 7.5%. The follow-up work [Hrkic et al. 2004] improves upon Beraudoand Lillis [2003] by incorporating two new techniques: a timing-driven fanin-tree embedding and replication tree. The authors first introduce an optimal

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 7: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

746 • G. Chen and J. Cong

algorithm to solve the fanin-tree embedding problem under a general cost func-tion; they then propose a replication tree to introduce large subcircuits to besolved by the embedding algorithm. The average delay reduction over VPRobtained from [Hrkic et al. 2004] is 14.2%.

However, limited work has been done to carry out clustering and/or logicduplication during placement. Neumann et al. [1999] apply logic duplicationin a recursive partitioning-based timing-driven placement flow. During eachrecursion, they sequentially perform timing analysis, net length estimation andweight calculation, bipartitioning, and cell replication. Before cells are assignedto rows, the redundancies introduced by gate replication are removed. Thiscombined approach outperforms gate sizing by 10%, on average.

In this article we propose an efficient and effective algorithm to performsimultaneous clustering and logic duplication during placement for both wire-length and timing minimization. First, we incorporate a path counting-basednet weighting scheme [Kong 2002] into our approach. Secondly, we introduce anovel fragment level move during placement to optimize the clustering struc-ture. Thirdly, we introduce the notions of a monotone region and a global mono-tone region, which enable the optimization of nonmonotone paths from a globalperspective. We then present an optimal incremental legalization algorithmunder complex clustering constraints using a constrained gain graph. Finally,we formulate a timing-constrained global redundancy removal problem andpropose a heuristic solution by solving the local redundancy removal problemsoptimally. The resulting algorithm, named SPCD, outperforms a widely usedacademic FPGA placement flow, T-VPack + VPR, with an average reduction of31% in the longest path estimate delay and 18% in the routed delay. Meanwhile,our combined approach has the same runtime complexity as the existing VPRplacement algorithm, and both runtime and area increases are very small.

3. INITIAL ANALYSIS

We first define the default baseline FPGA architecture that we use for many ofthe experiments in this article. In this default architecture, each BLE consistsof one four-input LUT and one FF, each CLB consists of four BLEs, all wiresspan only one logic block, and all routing switches are tristate buffers. Later onin Section 5, we shall extend and apply our study to an FPGA architecture withmore complex logic and routing structures, namely, Altera’s Stratix architec-ture. During the study of VPR’s placement results on this default architecture,we confirmed two observations mentioned in Beraudo and Lillis [2003].

First, the number of critical/near-critical pins is relatively small. Assumingthe longest path delay is T , a pin t is critical if slack(t) = 0; t is x% critical ifslack(t)/T ≤ x%. From Table I, we can see that, on average, the percentage ofcritical pins is 0.10%, the percentage of 5% critical pins is 1.1%, the percentageof 10% critical pins is 3.7%, the percentage of 15% critical pins is 8.5%, andthe percentage of 20% critical pins is 15.7%. It seems possible to perform avery small number of postplacement duplications to speed up the circuit by5–10%. However, it may involve many nodes to achieve more than a 10–15%speedup.

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 8: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 747

Table I. Percentage of Near-Critical Pins

Circuit 0% 5% 10% 15% 20%ex5p 0.13% 1.99% 6.54% 15.88% 30.68%apex4 0.15% 3.04% 10.14% 22.66% 39.62%Misex3 0.11% 1.40% 4.47% 11.83% 22.74%Tseng 0.28% 1.71% 4.17% 7.09% 9.45%alu4 0.12% 1.50% 5.53% 14.38% 26.53%dsip 0.05% 0.17% 0.98% 2.29% 4.83%seq 0.09% 0.73% 3.64% 9.69% 19.57%diffeq 0.17% 1.02% 3.00% 6.35% 10.96%apex2 0.11% 0.84% 5.93% 15.29% 27.36%s298 0.23% 3.60% 10.55% 19.43% 30.03%des 0.05% 0.24% 0.77% 1.93% 5.85%bigkey 0.04% 0.15% 0.27% 0.38% 1.17%spla 0.06% 1.17% 4.08% 10.85% 20.19%elliptic 0.07% 1.03% 3.81% 8.01% 12.75%ex1010 0.04% 0.78% 2.54% 7.23% 17.70%pdc 0.04% 0.46% 2.16% 5.98% 12.83%frisc 0.10% 0.76% 2.21% 4.67% 8.24%s38584.1 0.04% 0.36% 1.01% 2.05% 3.24%s38417 0.03% 0.44% 1.00% 1.99% 4.94%clma 0.02% 0.15% 0.51% 1.88% 4.96%Average 0.10% 1.08% 3.67% 8.49% 15.68%

Secondly, the critical paths are highly nonmonotone. For a path p consistingof m nodes (BLEs, in our case), v1, v2, . . . , vm, v1 is the start point and vm is theend point. The x coordinate of node vi is x(vi) and the y coordinate of node vi isy(vi). The Manhattan distance between any two nodes vi and vj is defined asdist(vi, vj )= |x(vi)−x(vj )|+| y(vi)− y(vj )|. For a subpath vi−1, vi, vi+1, it is mono-tone if both the x and y coordinates of vi−1, vi, vi+1 follow a monotone order; thelocal monotone region of vi is the placement region in which vi can be placed sothat the subpath is monotone; the deviation of vi with respect to one of its inputnodes vi−1 and one of its output nodes vi+1, namely, the measurement of howmuch a node is outside of its monotone region, is defined as dev(vi−1, vi, vi+1) =dist(vi−1, vi) + dist(vi, vi+1) – dist(vi−1, vi+1). The subpath vi−1, vi, vi+1 is mono-tone if dev(vi−1, vi, vi+1) = 0; that is, it is the shortest Manhattan dis-tance subpath. The Manhattan distance of path p is defined as dist(p) =∑m−1

i=1 dist(vi, vi+1). The minimum distance of path p is defined as min dist(p) =dist(v1, vm); that is., the Manhattan distance between the start and end point.The path p is globally monotone if dist(p) =min dist(p). The level of a path pis defined as level(p) = m. The unit dist is defined as the distance between twoadjacent CLBs. The detour ratio is defined as dr(p) = dist(p) / max(min dist(p),level(p)*unit dist). The symbol dist(p) is the actual Manhattan distance of p;assuming that p1 and pm are fixed, min dist(p) is the ideal Manhattan distanceof p when it is globally monotone; level(p)*unit dist is the distance of p whenit is placed almost ideally; the detour ratio describes how nonmonotone thepath p actually is. The reason that we use the maximum of min dist(p) andlevel(p)*unit dist to compute the detour ratio is that sometimes node v1 and vm

can be placed very close to, or even inside, the same CLB. For the measurement

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 9: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

748 • G. Chen and J. Cong

Table II. Detour Ratio of 5% Near-Critical Paths

Circuit avg dr(p) min dr(p) max dr(p)ex5p 3.31 2.00 10.14apex4 3.07 2.00 5.43Misex3 3.52 1.71 10.83Tseng 2.14 1.50 2.90alu4 3.50 2.03 10.29dsip 1.00 1.00 1.00seq 5.75 2.12 12.00diffeq 3.79 3.50 4.09apex2 4.30 1.68 9.86s298 6.64 4.73 7.92des 4.36 1.48 19.00bigkey 1.01 1.00 1.04spla 5.34 1.86 14.86elliptic 4.22 2.10 10.25ex1010 6.08 2.64 21.00pdc 3.53 1.85 15.50frisc 4.10 2.93 5.50s38584.1 3.43 1.25 8.25s38417 4.89 2.33 10.00clma 15.92 10.79 18.36Average 4.49 2.53 9.91

of the average/minimum/maximum detour ratios in Table II, we consider all the5% critical paths from PI/FFs to PO/FFs. Both the average detour ratio of 4.49and the minimum detour ratio of 2.53 in Table II show that the near-criticalpaths are far from monotone.

4. SIMULTANEOUS PLACEMENT WITH CLUSTERING AND DUPLICATION

4.1 Algorithm Overview

Our algorithm uses a simulated annealing-based optimization engine [Betzand Rose 1997; Marquardt et al. 2000] to minimize a weighted function of wire-length and timing (weighted edge delays). In Figure 7, we show the overall flowof our SPCD algorithm. First, we perform an initial clustering on a mappednetlist, then we generate a random placement. During the annealing process,we optimize the clustering structure and cluster locations at the same time. Toimprove the suboptimal clustering structure during placement, we introduce anovel fragment level move to relocate a mapped node (BLE) into/out of a clus-ter (CLB). After each move, we update the cost function and decide whetherthe move should be kept. We iteratively perform a number of moves at eachtemperature and then reduce the temperature until the acceptance ratio is toolow. At the end of each temperature, we perform redundancy removal, logic du-plication, and legalization, sequentially. To reduce the detour in critical pathsfrom a global perspective, we introduce the notions of a monotone region and aglobal monotone region. To handle the complex constraints in commercial FPGAarchitectures, we introduce a constrained gain graph and perform optimal in-cremental legalization. To control the runtime of the duplication procedure,

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 10: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 749

Fig. 7. SPCD algorithm overview.

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 11: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

750 • G. Chen and J. Cong

we limit the number of duplications allowed for each temperature. Throughexperiments, we find that good results can be achieved in a short runtime whenthis limit is logarithmic to the circuit size. In order to merge duplicated copiesto reduce area, we introduce a duplication graph representation and propose aheuristic to solve a global redundancy removal problem.

In the following section, we describe the key components of our SPCD al-gorithm: a path counting-based net weighting scheme, clustering optimizationduring placement, logic duplication, optimal incremental legalization undercomplex constraints, a duplication graph, and redundancy removal.

4.2 Path Counting-Based Net Weighting

Net-based timing-driven placers (e.g., [Marquardt et al. 2000]) convert timinginformation into net weight and optimize a weighted function of all nets. Thebasic idea of net weighting is to assign higher weights to timing-critical netsand lower weights to noncritical nets. The net weighting scheme is both efficientand flexible enough to handle complex constraints, but most existing methodsdo not take path information into account.

In this article we implement a novel net weighting scheme [Kong 2002] whichaccurately counts all paths (critical and noncritical) for certain types of discountfunctions. One such discount function is D(x, y) = a−x/ y , where a is a positiveconstant number, x is the slack of a path, and y is the delay of the longestpath. This scheme considers path sharing, and assigns a higher weight to edgesshared by two or more critical paths. For more details about path counting,please refer to Kong [2002].

4.3 Clustering Optimization During Placement

As an artificial step in an FPGA implementation flow, packing nodes into clus-ters has two benefits. First, it hides the complex packing constraints from theplacement algorithms. Secondly, it reduces the size of the placement problemand reduces runtime. However, due to a lack of physical information, the clus-tering procedure makes the wrong decisions in grouping logic. This becomesespecially severe when the chip utilization is high, and the clustering proce-dure has to perform unrelated packing. Furthermore, the placement procedurehonors the clustering solution and does not correct any packing mistakes.

One of our key contributions in this article is to optimize the clustering struc-ture during placement. Conventional FPGA placers only carry out the blocklevel move, which moves a CLB node to a new location and swaps with anotherCLB node if necessary. We introduce a novel and effective fragment level move,which moves a BLE node to a new CLB and swaps with another BLE node ifnecessary. As a result, we are able to significantly improve the suboptimal clus-tering structure and achieve a high-quality placement. With the simultaneousclustering and placement optimization, we can correct mistakes made duringthe previous clustering stage and significantly improve both wirelength andtiming.

After we perform a fragment level move, we must determine whether thenew CLB is in a valid configuration. To honor the packing constraints of aCLB, we need to check the number of BLEs and inputs. For commercial FPGA

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 12: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 751

Fig. 8. Overview of a duplication algorithm.

architectures, we also need to verify the number of clocks, the number of controlsignals, the number of feedbacks, etc. Hence, we dynamically update a set ofhash-maps for each involved CLB whenever a fragment level move is performed.The complexity of the update is O(K ), where K is the input size of a LUT. Bycarefully controlling the number of fragment level moves, we shall show inSection 6 that the complexity of our placement algorithm is O(n4/3).

4.4 Logic Duplication

Our logic duplication algorithm can be performed either after a full placement orduring a placement at the end of each iteration (e.g., in a simulated annealing-based placer or a quadratic programming-based placer). As shown in Figure 8,a timing analysis is first performed on the current placement. Then, we itera-tively select a candidate node on the critical path, duplicate or move it to a newdestination CLB, redistribute fanouts, and immediately legalize the destinationCLB to resolve any possible physical constraint conflicts. In case the delay afterlegalization increases, both the legalization and the duplication operations willbe undone. The iteration continues until there are no more candidates or untilthe limit on the number of duplications is reached.

4.4.1 Criticality-Driven Candidate Selection. Initially, we put all the near-critical PO pins into a heap sorted by the slack, and then we iteratively selectthe most critical pin t from the heap to perform the speedup operation. Thecandidate node, the source node s, is duplicated and moved to a new locationso as to straighten all the near-critical paths flowing through edge (s, t). Whentwo sink pins have the same criticality, we use the deviation of their sourcenodes to break ties. After a source node s is duplicated and legalized, we putthe remaining timing-critical input pins of the clone into the heap.

4.4.2 Monotone Region and Global Monotone Region. After a candidatenode s is chosen, we find a destination location for it in order to minimize thecritical path delay. We assume node s has k critical input nodes i1, i2 throughik , as shown in Figure 9.

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 13: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

752 • G. Chen and J. Cong

Fig. 9. An example of a monotone region.

Fig. 10. An example of global monotone region.

First, we define a monotone region MR({i j }, s, t) for node s with respect toone of its input nodes i j and the output node t, which is the minimum boundingbox enclosing i j , s, and t. Node s can be placed anywhere inside the monotoneregion MR({i j }, s, t) without increasing the deviation of s with respect to i j andt, hence the subpath i j → s → t can be shortened.

Next, we define a monotone region MR({i1, i2, . . . , ik}, s, t), which is the inter-section of all MR({i j }, s, t). Node s can be placed anywhere inside the monotoneregion MR({i1, i2, . . . , ik}, s, t) without increasing the deviation of s with respectto any of its input nodes and the output node t. If we search for the destina-tion location within this monotone region, all the critical paths passing throughedge (s, t) can be shortened.

Figure 9 is an example of the monotone region. The three rectangles withdashed lines are MR({i1}, s, t), MR({i2}, s, t), and MR({i3}, s, t), respectively; thegray rectangle is MR({i1, i2, i3}, s, t). For the largest circuit in our benchmarkset, clma, the average size of the monotone region is around 5% of the totalplacement area.

One drawback of the work of Beraudo and Lillis [2003] is that it cannothandle paths that are locally monotone but globally nonmonotone. As shown inFigure 10, both subpaths pi1 → r → s and r → s → t are monotonous (shortestpath). However, the complete path pi1 → r → s → t is nonmonotone, since they coordinates of the path from pi1 to t do not follow a monotone order. When the

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 14: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 753

net delay is linear to the Manhattan distance, such paths cannot be shortenedby the approach in Beraudo and Lillis [2003], or by using the monotone regionalone. To resolve such a global nonmonotone problem, when we attempt to movenode s, not only do we need to consider its direct fanin r, but also its primaryinput pi1. Therefore, we define the notion of a global monotone region. A fanincone Cv, rooted at v, is a connected subnetwork which consists of only v and itspredecessors; a critical fanin cone Ccrit v, rooted at v, is a connected subnetworkwhich consists of only v and its timing-critical predecessors. For the primaryinput set of Ccrit s, we assume there are l critical primary inputs pi1, pi2, . . . , pil .A global monotone region is defined as MR({pi1, pi2, . . . , pil }, s, t). During thecomputation of the global monotone region, we use the x and y coordinatesof primary inputs pi1, pi2, . . . , pil instead of the immediate inputs i1, i2, . . . , ik .With the introduction of the global monotone region, we give priority to locationsinside both regular and global monotone regions. As illustrated in Figure 10,node s will first be moved to s′, node r will then be moved to r ′, and the completepath will be shortened perfectly to pi1 → r ′ → s′ → t.

4.4.3 Destination Selection Within the Monotone Region. We iteratethrough each location inside the monotone region and choose the destinationlocation (x, y) such that �cost(x, y) is minimal. The �cost(x, y) function at alocation (x, y) is defined as �cost(x, y) = -α*�slack(t) + β*overflow cost(x, y) +γ *g path cost(x, y). The signs α, β andγ are predefined constants. The first term�slack(t) describes the timing improvement. The �slack(t) is defined as the in-crease in slack at sink pin t when the clone is placed at location (x, y). Thesecond term overflow cost(x, y) depicts the legality of the placement or the diffi-culty of the legalization. The term overflow cost(x, y) is 0 when the destinationlocation can accommodate the clone, otherwise, it is the difference between theactual usage and the capacity. Priority is given to locations that can accommo-date the copy of s without violating any physical constraints. The third termg path cost(x, y) characterizes the violation of the global monotonicity. The termg path cost(x, y) is 0 when the destination location is inside the global Monotoneregion, otherwise, it is the minimum distance from (x, y) to the global monotoneregion.

4.4.4 Timing-Driven Fanout Partitioning. After a clone node is placed ata destination location, we perform timing-driven fanout partitioning to redis-tribute the fanouts to their corresponding inputs. Each fanout node t is assignedto a copy of the source node such that the arrival time at t is minimal. This issimilar to the approach described in Beraudo and Lillis [2003].

4.5 Optimal Incremental Legalization under Complex Constraints

We perform legalization immediately after each duplication operation. If thedelay on the most critical path deteriorates, we will undo both legalization andduplication.

First, we describe the ripple-move-based legalization approach used in Hurand Lillis [2000]. For each ripple-move, we select a source location S with over-flow and a destination location T with extra capacity, and find a maximum

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 15: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

754 • G. Chen and J. Cong

Fig. 11. An example of a gain graph.

gain monotone path from S to T along which a sequence of cells is moved. Todetermine the maximum gain path and the cells to be moved, a global analysisbased on the gains of individual cells is performed. Given the source S and thedestination T , each cell can only be moved in, at most, two directions. The gainvalue associated with each cell move is the reduction in the cost function, andthe gain value associated with each location and direction is the maximum gainvalue among all the cells moving in that direction. Then, we can construct again graph in which each vertex corresponds to a location inside the rectangu-lar region determined by S and T , and in which each weighted arc representsthe maximum gain value in the direction of the arc. Figure 11 is an exampleof a gain graph. Since a gain graph is acyclic, the maximum gain path can befound by dynamic programming in a topological order. When a ripple-move isperformed on this maximum gain path, a cell is allowed to move more thanonce so that the final gain is equal to or better than the value determined bythe maximum gain path.

When each cell is allowed to move, at most, one distance away, the ripple-move algorithm is optimal for certain cost functions such as the bounding boxwirelength [Hur and Lillis 2000], the weighted source-sink distance, etc. How-ever, the optimality of the maximum gain path does not hold for a general costfunction. For example, the timing cost is the summation of weighted delays overall edges, and the delay of an edge is determined by the locations of both sourceand sink pins. If there is an edge between two cells on the maximum monotonegain path, then the timing cost reduction precomputed for the sink node wouldbe inaccurate, since the source node is moved as well. As a result, the maxi-mum gain path computed by the ripple-move algorithm may not be optimal fortiming optimization under a general delay model. However, if the interconnectdelay is a linear function of the Manhattan distance, the maximum gain pathis still optimal.

In reality, commercial FPGA architectures have complex clustering con-straints at the CLB level, in addition to capacity constraints such as the inputconstraint, clock constraint, control signal constraint, etc. Since the gain graphdoes not consider such complex constraints at all, we introduce the notion of aconstrained gain graph.

For example, in the artificial FPGA architecture shown in Figure 12, eachCLB contains two BLEs and six inputs. This simplified architecture imposes twoclustering constraints: a capacity constraint of two, and an input constraint of

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 16: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 755

Fig. 12. Construction of a constrained gain graph.

six. Give that a source CLB S contains three BLEs, and a sink CLB T containsonly one BLE, we want to find a monotone path from S to T to maximizegain while observing the clustering constraints for each CLB on the path. If weassume that x(T ) > x(S) and y(T ) > y(S), then each move along the monotonepath is in the direction of either north or east. For an internal CLB C, there are,at most, two incoming nodes from the west, two incoming nodes from the south,and two outgoing nodes. If a CLB C is still in a legal configuration after movingin a node vi and moving out a node vj , we draw an edge from vi to vj with aproper gain value. At best, we can build a 4 × 2 bipartite graph between Candits neighbors from west and south. For the purpose of illustration, we draw allfeasible edges between CLB Cand its four neighbors in Figure 12. The gainof an edge in the constrained gain graph is the reduction of the cost functionwhen the source node of the edge is moved from the source CLB to the sinkCLB. Finally, we create a pseudosource node s and a pseudosink node t.

Once the constrained gain graph is constructed, the constrained legalizationproblem is transformed to a longest-path problem for a directed acyclic graph,which can be optimally solved with a complexity of O(n). Our algorithm is veryflexible for handling any CLB-level clustering constraints.

4.6 Duplication Graph and Redundancy Removal

Since we perform logic duplication at the end of each annealing iteration, twoproblems may arise. First, the area increase due to cloning may be substantial.Secondly, some of the duplications committed in the previous iterations maybecome noncritical. If we do not remove them in a timely manner, they mayoccupy timing-critical locations and affect future duplication processes. There-fore, before logic duplication takes place at the end of each annealing iteration,we merge duplicated copies to reduce the circuit area and maintain speed.

We introduce a data structure called a duplication graph, which is the orig-inal netlist with two modifications. First, to keep track of all the copies of thesame node, we introduce the notion of a choice node whose fanin nodes are alllogically equivalent. When a duplication graph is initialized, we create a choicenode for each node in the netlist. Then, for each choice node c, we introduce

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 17: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

756 • G. Chen and J. Cong

Fig. 13. Illustration of a duplication graph.

a new net e. Assume c has k fanin nodes g1, g2, . . . , gk , and that each of thefanin nodes has an output net e1, e2, . . . , ek . The source pin of net e is choicenode c, and the sink pins of net e are the union of all sink pins of e1, e2, . . . , ek .Figure 13 is an illustration of the duplication graph. Under choice node c1, g ′

1 isa copy of g1. Choice node c1drives two gates, g2 and g3, which fanout to choicenodes c2 and c3, respectively. Choice node c2 drives primary outputs po1, andc3 drives po2. Initially, each choice node contains only one fanin. During thelogic duplication step, the duplicated copies are added to the duplication graphincrementally.

In a duplication graph N = (C, V , E), each node c ∈ C represents a choicenode, each gate v ∈ V represents a logic gate, and each directed edge e = (c, v) ∈E represents a wire connecting the output of choice node c to one input of alogic gate v. Each choice node c is a set in which each gate g ∈ c is a logic gatewith equal functionalities. For each directed edge e = (c, v) ∈ E, the arrivaltime arr t(v) at the sink pin is min(arr t(g) + delay(g, v))for every g ∈ c.

We formulate a global timing-constrained redundancy removal problem. Un-der the given timing constraints, slack(e) ≥ 0 for every edge e ∈ E. We want tofind a maximum set S, and remove every v ∈ S and every edge e = (c, v) ∈ Efrom N such that in the new duplication graph N ′ = (C, V ′, E ′), slack(e′) ≥ 0for every edge e′ ∈ E ′ and c �= ∅ for every choice node c ∈ C.

We also formulate a local redundancy removal problem under timing con-straints. For a choice node c, we assume c has m fanin gates g1 through gm,and n fanout gates vl through vn. Under the timing constraints, each fanout gatevi has a required arrival time req t(vi). We want to find a maximum subset S ofc and remove every g ∈ S from c such that arr t(vi) ≤ req t(vi) for all the fanoutgates of c. We build an m by n matrix,and define the value matrix (i, j ) at row icolumn j as the following: if arr t(vj ) ≤ req t(vj ) when vj is driven by gi, matrix(i, j ) = 1; otherwise matrix (i, j ) = 0. To solve the local redundancy removalproblem, we need to select a minimum number of rows such that every columncontains at least a one. This is a unate covering, or a minimum set covering,problem which is NP-complete. Since a minimum set covering problem can betransformed to a local redundancy removal problem, the local removal problemis NP-complete as well. If we limit m to a small constant (e.g., five) during thelogic duplication, we can solve the local redundancy removal problem optimallyusing the reduction techniques together with a branch-and-bound algorithm.

We propose a heuristic to solve the global redundancy removal problem bysolving the local redundancy removal problem in a reverse topological order.During the traversal of the duplication graph from PO to PI, we optimallyperform local redundancy removal for each choice node with multiple fanins.

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 18: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 757

Fig. 14. The validation flow.

After a local redundancy problem is solved, we perform duplication removal andfanout partitioning together. Then, we propagate the remaining time during theincremental timing update. We can incrementally update the required timeduring the redundancy removal process.

5. VALIDATION IN A COMMERCIAL FPGA IMPLEMENTATION FLOW

The Quartus University interface program (QUIP) kit is designed to enable uni-versity or other researchers to plug new CAD tools and ideas into the completeAltera’s Quartus II CAD flow—from register transfer level (and even above) de-scriptions of circuits to programming files for real FPGAs. With help from QUIP,we have built the first academic flow that allows for direct comparison with theQuartus II physical implementation tools. To target a Stratix device, we modifythe VPR architecture file to describe LAB fitting rules (described in detail inSection 5.2.3) and cell delays for the LUT, FF, and pads. The interconnect delayis provided by an API function from QUIP.

5.1 The Validation Flow

Figure 14 is an overview of our entire validation flow. First, we run thescript.algebraic in SIS [Sentovich et al. 1992], followed by Flowmap [Cong andDing 1994], a depth-optimal mapping algorithm for LUT-Based FPGAs. In orderto eliminate long interconnect delays between pads and design logic (especiallywhen a small design is fitted to a relatively large device), we intentionally inserta flip-flop (FF) after each primary input and before each primary output. Sincewe only measure the clock frequency of the modified netlist, the pad placementis irrelevant.

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 19: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

758 • G. Chen and J. Cong

Next, for the Quartus II flow, the modified netlist is converted to an Alteraformat (.vqm). This is done by a dumper utility named net2vqm, distributed aspart of the QUIP package. However, the original version assumes that each CLBcontains only one BLE, while in the Stratix architecture each CLB contains tenBLEs. We performed extensive modifications to the utility to support multipleBLEs in a single CLB. After that, the Quartus II’s fitter reads in the .vqm filesand performs clustering, placement, and routing sequentially. Since Quartus IIis a constraint-driven optimization engine, we set a small maximum clock fre-quency constraint of three ns. Finally, we run Quartus II’s timer to report thetiming result.

For our SPCD flow, we first run T-VPack [Marquardt et al. 1999] to generatean initial clustering solution. Then we perform simultaneous placement withclustering and duplication (SPCD). After that, we convert the new netlist to.vqm format and generate location constraints for all design logics. The new.vqm netlist, together with the timing/location constraints, is then given toQuartus II. The fitter honors our clustering and placement constraints andperforms routing only. Finally, we run Quartus II’s timer to report the maximumfrequency.

5.2 Placement Engine Extension

In order to accurately model the Stratix architecture and generate valid phys-ical constraints, we need to perform several enhancements to our placementengine.

5.2.1 Heterogeneous Resources. VPR only considers simple FPGA architec-tures with CLBs and pads. In contrast, commercial FPGA architectures such asStratix and Virtex2 contain memory blocks, DSP blocks, etc. In order to gener-ate valid physical locations for the Stratix device, we consider these resourcesas well. Since the MCNC benchmark circuits we use do not contain such macros,we simply mark such locations as blockages and do not use them to place designlogic.

5.2.2 Delay Modeling. We need to modify VPR’s delay model to target theStratix devices. The delay consists of two parts: a cell delay and an interconnectdelay. For the cell delay, we use the value in the library file. Since the propa-gation delays from each input pin of a four-input LUT are different, we usean average value. For the interconnect delay, we directly call an API functionget point to point delay() provided by the QUIP package.

5.2.3 LAB Fitting Rules. In the Stratix architecture, each LAB containsten logic elements (LEs). Since the MCNC benchmark circuits we use do nothave complex clock schemes, we only need to observe the following subset ofLAB fitting rules. For more details, please refer to the Stratix Device Handbook[2006].

(1) For good routability, the number of distinct data inputs (excluding CIN andfeedbacks) should not exceed 26;

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 20: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 759

(2) All .clk and .ena signals on a logic cell are paired to form an LE clock. Inany LAB, there can be no more than two distinct LE clock pairs;

(3) A maximum of two distinct signals can be connected to the .aclr ports; and(4) A maximum of one distinct signal can be connected to the .aload ports.

6. RUNTIME/QUALITY TRADEOFF AND COMPLEXITY ANALYSIS

The runtime of our SPCD algorithm consists of two parts: a placement engineand a duplication/legalization engine. First, we analyze the complexity of theplacement engine. For a given architecture, each CLB contains N BLEs, I in-puts, and N outputs. In the input-clustered netlist, the number of CLBs is n,and the number of BLEs is m. Therefore, n ≤ m ≤ N*n, and O(m) = O(N*n) =N*O(n). In our SPCD algorithm, we perform both block level moves and frag-ment level moves. At each temperature, the number of block level moves per-formed is n4/3, and the number of fragment level moves performed is (α∗m)1.33 ≈(α ∗ N ∗ n)1.33. We can choose the value of α between zero and one and achievethe runtime/quality tradeoff. As a result, the complexity of the block level moveis O(n4/3), and the complexity of the fragment level move is O((α ∗ N ∗ n)4/3).In reality, the value of N is not very big, and we can always choose α to makeO((α ∗ N ∗n)4/3) = O(n4/3). Hence, the overall complexity is O(n4/3). As a result,our algorithm’s complexity is similar to VPR, and hence quite scalable.

Then, we analyze the complexity of the duplication/legalization engine. Foreach source node, the complexity of the monotone region computation is O(K ).The maximum size of the monotone region is the size of the device, which is O(n).For each location in a monotone region, we need to recalculate the edge delayfor all input pins of a node s and the sink pin t, and that is an O(K ) operation.Since the size of the monotone region is worst-case O(n), the complexity offinding the optimal destination location is O(n). For legalization, we assume thedistance between the source location and the destination location is dx and dy,respectively. The complexity for constructing the gain graph is O(dx*dy*N), andthe complexity for the maximum gain path algorithm is O(dx*dy). Since dx isbounded by the width of the device and dy is bounded by the height of the device,the legalization algorithm has a worst-case complexity of O(n). Also, we performstatic timing analysis during the duplication, which is an O(n) operation aswell. Since we limit the number of duplications to a small number (logarithmicto the circuit size) and the number of annealing iterations to another constant(∼100), the overall duplication/legalization has a complexity of O(nlogn). As aresult, the overall SPCD algorithm has a runtime complexity ofO(n4/3).

7. EXPERIMENTAL RESULTS

We implemented our SPCD algorithm under the VPR framework. For purposesof comparison, we downloaded the VPR 4.3 source code, architecture file, andthe complete set of 20 MCNC benchmark circuits from the FGPA Place-and-Route Challenge [2006]. We modified the architecture file to specify the numberof BLEs contained in a single CLB. We compared all of the 20 MCNC circuitswith the commonly used academic FPGA design flow in Figure 15. We first ranthe script.algebraic in SIS [Sentovich et al. 1992], followed by Flowmap [Cong

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 21: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

760 • G. Chen and J. Cong

Fig. 15. The experimental flow.

and Ding 1994]. Then we ran T-VPack [Marquardt et al. 1999] to generate aninitial clustering solution. This initial clustering was then given to both timing-driven VPR and SPCD to perform placement. Except in Section 7.1, we alwayscompare our results with the timing-driven VPR. Our SPCD algorithm hasseveral different modes: SPC, SPD, and SPCD. SPC performs clustering duringplacement; SPD performs logic duplication during placement; and SPCD per-forms both clustering and duplication during placement. Furthermore, SPD hasthree options: SPD-0 has zero duplication, and it is essentially the same as VPR;SPD-1 performs logic duplication only once after the full placement is obtained;SPD-m performs simultaneous logic duplication and placement optimization.SPCD with path counting utilizes a path counting-based net weighting scheme.In Sections 7.1 to 7.4, we conduct experiments against VPR on the default ar-chitecture described in Section 3. Finally, in Section 7.5 we also compare SPCDwith the Quartus II 4.0 implementation tool on the Stratix architecture.

7.1 Wirelength Comparison

Since duplication does not improve the wirelength, we compare our algorithmSPC (no duplication) with VPR in Table III using the total weighted half-bounding-box wirelengths as the only optimization objective. The weights fornets of different sizes can be found in Chen and Cong [2004]. When we combineclustering with placement, we can outperform VPR by 27%, on average.

In Figure 16 we illustrate the impact of CLB size (N ) on the wirelengthimprovement obtained from SPC. When we change the CLB size from two to ten,the wirelength gap between SPC and T-Vpack+VPR increases monotonicallyfrom 15% to 36%. The result shows that as the size of CLB increases, it is moreand more difficult to generate a good clustering solution with small wirelengthwithout physical information. Since SPC explores different clustering solutionsduring the placement stage, it generates clustering and placement solutionswith much shorter wirelength.

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 22: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 761

Table III. Wirelength Improvement of SPC (N = 4)

Circuit VPR SPC %ex5p 112.47 92.7707 17.52%apex4 113.639 94.2218 17.09%misex3 123.616 99.2435 19.72%Tseng 94.9456 61.6671 35.05%alu4 123.03 95.3293 22.52%dsip 195.544 94.8918 51.47%seq 173.641 135.756 21.82%diffeq 132.271 88.7259 32.92%apex2 190.324 151.032 20.64%s298 166.899 140.351 15.91%des 278.122 210.536 24.30%bigkey 171.986 155.525 9.57%spla 426.227 324.999 23.75%elliptic 359.011 228.558 36.34%ex1010 463.618 341.405 26.36%pdc 704.286 545.51 22.54%frisc 584.732 432.13 26.10%s38584.1 576.457 321.653 44.20%s38417 696.701 424.874 39.02%clma 1701.02 1169.64 31.24%Average 26.90%

Fig. 16. Impact of CLB size on wirelength improvement of SPC.

7.2 Timing Comparison

7.2.1 Impact of Clustering. In Table IV we compare SPC with both VPRand path [Kong 2002] in timing optimization. If we use the path counting-based net weighting scheme only in SPC, we can outperform VPR by 12% (col-umn 4); if we perform clustering optimization only in SPC, we can outperformVPR by 16% (column 6); if we integrate the path counting-based net weightingscheme with the clustering optimization, SPC significantly outperforms theoriginal VPR result by 25%. However, the wirelength reduction obtained bySPC in its timing mode is reduced to about 15% from 27% when compared withVPR.

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 23: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

762 • G. Chen and J. Cong

Table IV. Timing Improvement of SPC (N = 4)

Circuit VPR Path % SPC % SPC + Path %ex5p 50.45 43.14 16.94% 42.71 18.14% 40.75 23.80%apex4 47.44 41.96 13.07% 46.96 1.04% 41.44 14.49%misex3 51.04 44.81 13.91% 41.00 24.49% 38.53 32.47%tseng 38.85 36.15 7.48% 32.77 18.55% 35.11 10.65%alu4 53.16 44.12 20.50% 46.85 13.46% 42.50 25.07%dsip 38.32 38.15 0.44% 34.30 11.73% 40.12 −4.49%seq 51.26 46.62 9.97% 44.46 15.29% 42.90 19.51%diffeq 47.73 43.52 9.69% 37.49 27.30% 41.15 16.01%apex2 56.36 54.94 2.60% 52.16 8.07% 47.23 19.34%s298 87.36 84.63 3.23% 88.40 −1.17% 80.98 7.88%des 83.88 75.42 11.20% 76.85 9.14% 65.44 28.18%bigkey 41.37 43.29 −4.44% 41.56 −0.47% 41.35 0.03%spla 72.47 63.86 13.48% 66.51 8.95% 58.27 24.35%elliptic 71.07 54.47 30.48% 76.94 −7.63% 48.49 46.58%ex1010 97.88 80.64 21.38% 85.16 14.93% 74.82 30.81%pdc 113.15 77.93 45.18% 79.43 42.45% 67.60 67.38%frisc 81.39 93.16 −12.64% 77.27 5.33% 75.73 7.47%s38584 64.37 61.63 4.45% 45.44 41.67% 47.78 34.71%s38417 76.63 71.94 6.52% 48.45 58.17% 49.89 53.61%clma 137.20 114.48 19.85% 125.03 9.73% 102.0 34.52%Average 11.66% 15.96% 24.62%

Fig. 17. Impact of CLB size on the timing improvement of SPC.

In Figure 17 we illustrate the impact of CLB size on the timing improvementobtained from SPC with path counting. When the CLB size is two, the timinggap between SPC and T-Vpack+VPR is 17%. When the CLB size increasesfrom four to ten, the gap remains in a narrow range between 22 and 25%. Theresult shows that even when the CLB size is relatively small (two or four),it is difficult to generate a good clustering solution with small delay withoutphysical information. Since SPC explores different clustering solutions duringthe placement stage, it generates clustering and placement solutions with muchbetter delay.

In Table V we illustrate the performance of SPC with path counting ona different routing architecture. For the results in Figure 17, we use the

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 24: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 763

Table V. Impact of Routing Architecture onTiming Improvement of SPC (N = 4)

Circuit VPR SPC Improvementex5p 33.68 31.60 6.58%apex4 31.74 31.29 1.45%misex3 31.05 28.05 10.67%Tseng 44.56 37.12 20.03%alu4 34.32 32.29 6.30%dsip 16.32 17.16 −4.90%seq 31.67 29.98 5.64%diffeq 45.65 41.63 9.64%apex2 38.45 34.84 10.36%s298 61.39 61.92 −0.85%des 32.38 27.26 18.81%bigkey 21.64 23.15 −6.51%spla 45.03 40.25 11.88%elliptic 44.47 42.32 5.09%ex1010 53.74 47.96 12.05%pdc 58.94 44.82 31.51%frisc 70.03 67.71 3.43%s38584.1 31.53 35.53 −11.25%s38417 47.84 41.37 15.66%clma 79.07 66.25 19.35%Average 8.25%

Table VI. Timing Improvement of SPD (N = 4)

SPD-m w/Circuit VPR SPD-0 SPD-1 % SPD-m % path counting %ex5p 50.45 51.66 47.76 8.2% 46.80 7.80% 42.31 19.24%apex4 47.44 48.30 46.40 4.1% 43.86 8.16% 40.2 18.01%misex3 51.04 48.94 45.92 6.6% 42.14 21.12% 37.85 34.85%tseng 38.85 38.55 34.80 10.8% 28.71 35.32% 29.29 32.64%alu4 53.16 54.85 52.45 4.6% 43.56 22.04% 43.44 22.38%dsip 38.32 38.92 38.92 0.0% 41.37 −7.37% 33.35 14.90%seq 51.26 53.29 49.44 7.8% 43.56 17.68% 42.4 20.90%diffeq 47.73 45.26 40.93 10.6% 36.53 30.66% 38.57 23.75%apex2 56.36 58.75 55.05 6.7% 53.58 5.19% 45.32 24.36%s298 87.36 82.70 75.26 9.9% 80.56 8.44% 88.4 −1.18%des 83.88 81.71 73.60 11.0% 71.48 17.35% 63.29 32.53%bigkey 41.37 41.51 40.74 1.9% 38.18 8.36% 36.46 13.47%spla 72.47 72.09 66.45 8.5% 63.72 13.73% 65.64 10.41%elliptic 71.07 64.48 62.32 3.5% 59.89 18.67% 52.27 35.97%ex1010 97.88 95.24 87.06 9.4% 87.82 11.46% 71.59 36.72%pdc 113.15 95.89 87.51 9.6% 82.87 36.54% 69.68 62.39%frisc 81.39 83.81 76.02 10.3% 75.51 7.79% 79.33 2.60%s38584.1 64.37 53.98 52.16 3.5% 46.72 37.78% 48.82 31.85%s38417 76.63 76.84 70.18 9.5% 53.29 43.80% 55.2 38.82%clma 137.20 136.56 123.93 10.2% 106.92 28.32% 99.39 38.04%Average 7.32% 18.64% 25.63%

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 25: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

764 • G. Chen and J. Cong

Fig. 18. Impact of CLB size on timing improvement of SPD-1.

default routing architecture obtained from the FPGA Place-and-Route Chal-lenge [2006], in which routing segments have a length of one and all routingswitches are tristate buffers. Since interconnect delays are very sensitive to dis-tance in this default architecture, the placement algorithms are vital to designperformance. In Table V we try a different routing architecture; here, rout-ing segments have a length of four and are connected to both tristate buffersand pass transistors. Interconnect delays in this architecture are much lesssensitive to distance compared to the default routing architecture. The timingimprovement obtained from SPC on this architecture is only 8%. For a realisticarchitecture with routing segments of multiple lengths (1, 4, . . . , etc.), the delayimprovement should fall somewhere between 8% and 25% (please refer to thedelay comparison using Stratix architecture shown in Table XII).

7.2.2 Impact of Logic Duplication. In Table VI, we show the impact of logicduplication on timing using the default architecture. Column 3 is the resultof SPD-0, which is our implementation of VPR without any logic duplication.The result of SPD-0 is, in general, similar to that of VPR. In column 4 weperform duplication/legalization only once after full placement (SPD-1), and weachieve, on average, around a 7% timing improvement. In column 6 we performsimultaneous logic duplication and placement optimization (SPD-m), and weoutperform VPR by 19%. In column 8 we integrate the path counting-basednet weighting scheme with the duplication optimization, and SPD-m with pathcounting significantly outperforms the original VPR result by 26%.

In Figure 18, we illustrate the impact of CLB size on the performance ofSPD-1. When the CLB size is one, the timing improvement obtained from du-plication is 5%. When the CLB size increases from two to ten, the timing im-provement remains in a narrow range between 7 and 9%. The result showsthat when the CLB size is greater than one, there is more room for duplicationsince the delay between BLEs within the same CLB is normally smaller thanthe delay between different CLBs.

In Figure 19, we illustrate the impact of CLB size on the performance ofSPD-m with path counting. When the CLB size is one, the performance gap

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 26: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 765

Fig. 19. Impact of CLB size on timing improvement of SPD-m.

Table VII. Timing Comparisons between SPD, SPC and SPCD

Circuit SPD-m SPD-m + Path SPC SPC + Path SPD-m + SPC + Pathex5p 7.80% 19.24% 18.14% 23.80% 21.76%apex4 8.16% 18.01% 1.04% 14.49% 30.23%misex3 21.12% 34.85% 24.49% 32.47% 36.02%tseng 35.32% 32.64% 18.55% 10.65% 21.99%alu4 22.04% 22.38% 13.46% 25.07% 23.34%dsip −7.37% 14.90% 11.73% −4.49% −5.64%seq 17.68% 20.90% 15.29% 19.51% 27.16%diffeq 30.66% 23.75% 27.30% 16.01% 39.33%apex2 5.19% 24.36% 8.07% 19.34% 29.39%s298 8.44% −1.18% −1.17% 7.88% 16.08%des 17.35% 32.53% 9.14% 28.18% 37.07%bigkey 8.36% 13.47% −0.47% 0.03% 31.80%spla 13.73% 10.41% 8.95% 24.35% 37.20%elliptic 18.67% 35.97% −7.63% 46.58% 19.89%ex1010 11.46% 36.72% 14.93% 30.81% 49.51%pdc 36.54% 62.39% 42.45% 67.38% 62.86%frisc 7.79% 2.60% 5.33% 7.47% 16.30%s38584 37.78% 31.85% 41.67% 34.71% 30.41%s38417 43.80% 38.82% 58.17% 53.61% 48.21%clma 28.32% 38.04% 9.73% 34.52% 47.18%Average 18.64% 25.63% 15.96% 24.62% 31.01%

between SPD-m and T-Vpack+VPR is 18%. When the CLB size increases fromtwo to ten, the timing gap between SPD-m and T-Vpack+VPR gradually in-creases from 21 to 27%. The result shows that even when the CLB size is rel-atively small (one or two), integrating duplication with placement has a greatimpact on circuit performance.

7.2.3 Comparison of SPC, SPD, and SPCD. As shown Table VII, withoutthe path counting-based net weighting scheme, SPD-m outperforms SPC bya few percentages; with path counting, both SPD-m and SPC achieve a sim-ilar improvement of 25–26%; when all three techniques are combined, SPCDsignificantly outperforms T-Vpack+VPR by 31%.

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 27: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

766 • G. Chen and J. Cong

Table VIII. Effect of α on Timing (CLB = 4)

α = 0.25 α = 0.50 α = 1.0Timing Runtime Timing Tuntime Timing Runtime

Circuit Improvement Ratio Improvement Ratio Improvement Ratiodes 24.47% 26.15% 28.29% 38.13% 32.64% 70.28%bigkey −10.20% 30.42% 16.00% 42.59% 7.18% 72.58%spla 27.09% 41.17% 34.99% 49.39% 28.57% 76.59%elliptic 51.06% 42.89% 48.61% 51.12% 49.63% 73.86%ex1010 29.69% 36.08% 31.24% 41.53% 31.73% 64.08%pdc 58.75% 32.75% 69.24% 39.72% 86.41% 58.41%frisc 0.66% 33.61% −0.72% 40.23% 7.33% 60.61%s38584.1 43.58% 26.62% 47.86% 32.53% 34.81% 47.41%s38417 27.05% 32.01% 53.17% 37.47% 60.67% 59.77%clma 27.80% 25.82% 38.38% 31.75% 41.08% 48.29%Average 21.59% 32.75% 31.18% 40.45% 30.83% 63.19%

Table IX. Effect of α on Timing (CLB = 10)

α = 0.25 α = 0.50 α = 1.0Timing Runtime Timing Tuntime Timing Runtime

Circuit Improvement Ratio Improvement Ratio Improvement Ratiodes 16.59% 24.53% 19.22% 36.35% 20.88% 71.59%bigkey −4.34% 36.24% −9.78% 54.49% 2.95% 98.44%spla 33.13% 61.99% 43.21% 74.56% 40.32% 121.37%elliptic 50.30% 45.57% 54.03% 47.84% 52.97% 72.16%ex1010 38.80% 54.47% 26.00% 61.75% 32.12% 98.68%pdc 49.99% 50.79% 52.58% 60.38% 54.95% 95.26%frisc −2.29% 46.78% 10.42% 57.29% 14.86% 88.07%s38584.1 30.30% 31.73% 39.66% 33.08% 39.01% 52.39%s38417 22.18% 43.91% 41.60% 50.76% 40.63% 81.65%clma 54.89% 28.85% 71.51% 36.69% 83.96% 58.08%Average 20.18% 42.49% 27.05% 51.32% 31.25% 83.77%

Furthermore, we observe that the timing improvement of path counting isorthogonal to that of clustering and duplication, but the timing improvementof clustering and duplication overlaps with one another. When we perform logicduplication, we both duplicate and relocate cells to new CLBs, thus changing theclustering structure dramatically. Therefore, clustering and duplication sharea large portion of their solution space.

7.3 Runtime Comparison

7.3.1 Runtime/Quality Tradeoff of SPC. In the previous experiments inSections 7.1 and 7.2.1 we performed m1.33 ≈ (N ∗n)1.33 fragment moves and zeroblock moves. In this section, we fix the number of block moves to n1.33, and set thenumber of fragment moves to (α∗m)1.33 ≈ (α∗N ∗n)1.33, where α is between zeroand one. In Table VIII we show the impact of α on the runtime/quality tradeoffof SPC. It is no surprise that when α increases, that is, the number of fragmentmoves increases, the timing improvement increases from 22% to 31%. Our run-time is generally shorter than VPR because the number of block moves we per-form is only 10% of VPR’s. If we reduce the number of block moves VPR performsto the same as SPC, it yields about 5% worse results (both in timing and

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 28: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 767

Table X. Area and Runtime Comparison of SPD

SPD-0 SPD-1 SPD-mCircuit Area Runtime Area Runtime Area Runtimeex5p 1274 65.344 1274 0.00% 65.31 −0.05% 1311 2.90% 62.81 −3.87%apex4 1319 65.234 1320 0.08% 65.63 0.60% 1337 1.36% 72.30 10.83%misex3 1529 86.516 1535 0.39% 86.67 0.18% 1534 0.33% 75.53 −12.70%tseng 1473 87.516 1476 0.20% 87.00 −0.59% 1481 0.54% 95.08 8.64%alu4 1630 81.953 1630 0.00% 81.89 −0.08% 1631 0.06% 79.78 −2.65%dsip 2045 115.828 2045 0.00% 116.75 0.80% 2045 0.00% 116.86 0.89%seq 2029 134.922 2030 0.05% 137.89 2.20% 2033 0.20% 127.06 −5.82%diffeq 2036 130.391 2036 0.00% 131.02 0.48% 2039 0.15% 123.03 −5.64%apex2 2159 149.281 2160 0.05% 152.36 2.06% 2165 0.28% 156.48 4.83%s298 2558 146.406 2558 0.00% 148.33 1.31% 2559 0.04% 165.83 13.27%des 2673 203.5 2673 0.00% 203.86 0.18% 2673 0.00% 206.41 1.43%bigkey 3361 239.516 3361 0.00% 241.52 0.83% 3361 0.00% 242.44 1.22%spla 3999 371.922 4004 0.13% 373.81 0.51% 4036 0.93% 385.83 3.74%elliptic 4430 473.188 4476 1.04% 475.92 0.58% 4448 0.41% 484.30 2.35%ex1010 4740 444.25 4740 0.00% 445.86 0.36% 4743 0.06% 438.41 −1.32%pdc 5672 664.532 5674 0.04% 666.28 0.26% 5678 0.11% 638.47 −3.92%frisc 6061 617.265 6061 0.00% 620.28 0.49% 6074 0.21% 665.47 7.81%s38584.1 7375 819.859 7380 0.07% 822.74 0.35% 7375 0.00% 831.14 1.38%s38417 8589 993.36 8591 0.02% 996.30 0.30% 8604 0.17% 1105.92 11.33%clma 13673 2214.188 13674 0.01% 2208.86 −0.24% 13688 0.11% 2415.09 9.07%Average 0.10% 0.53% 0.39% 2.04%

Table XI. Routed Delay and Track Count Comparison

VPR SPC SPD-mRouted Touted Routed

Circuit Delay #Tracks Delay % #Tracks % Delay % #Tracks %ex5p 52.38 646 45.47 15.20% 627 3.03% 46.15 13.50% 798 −19.05%apex4 55.93 627 46.32 20.75% 665 −5.71% 51.72 8.14% 703 −10.81%misex3 56.56 588 40.92 38.22% 588 0.00% 44.50 27.10% 714 −17.65%tseng 41.08 483 36.35 13.01% 437 10.53% 33.92 21.11% 506 −4.55%alu4 55.16 594 47.47 16.20% 506 17.39% 47.12 17.06% 616 −3.57%dsip 38.80 935 35.73 8.59% 660 41.67% 35.06 10.67% 935 0.00%seq 58.13 744 49.12 18.34% 768 −3.13% 54.51 6.64% 792 −6.06%diffeq 50.41 506 39.57 27.39% 506 0.00% 40.67 23.95% 552 −8.33%apex2 58.00 775 48.03 20.76% 725 6.90% 52.54 10.39% 875 −11.43%s298 103.69 648 89.43 15.95% 621 4.35% 82.76 25.29% 918 −29.41%des 88.32 960 69.26 27.52% 832 15.38% 70.76 24.82% 1024 −6.25%bigkey 42.60 495 48.40 −11.98% 550 −10.00% 39.59 7.60% 495 0.00%spla 78.65 1452 67.68 16.21% 1287 12.82% 64.12 22.66% 1683 −13.73%elliptic 75.16 1156 62.14 20.95% 1054 9.68% 71.45 5.19% 1326 −12.82%ex1010 102.88 1188 81.58 26.11% 1008 17.86% 78.95 30.31% 1512 −21.43%pdc 125.46 2028 93.18 34.64% 1755 15.56% 89.84 39.65% 2301 −11.86%frisc 87.64 1560 127.02 −31.00% 1600 −2.50% 85.90 2.03% 1760 −11.36%s38584.1 66.41 1276 47.02 41.24% 924 38.10% 51.21 29.68% 1364 −6.45%s38417 81.54 1410 54.15 50.58% 1128 25.00% 64.91 25.62% 1504 −6.25%clma 144.02 2760 124.14 16.01% 2040 35.29% 125.73 14.55% 3120 −11.54%Average 19.23% 11.61% 18.30% −10.63%

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 29: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

768 • G. Chen and J. Cong

Table XII. Placement Estimated Delay Comparison on Stratix

Circuit VPR Path % SPC % SPD-m % SPCD %ex5p 8.26 8.20 0.82% 7.91 4.44% 7.66 7.86% 7.34 12.54%apex4 8.04 7.61 5.69% 7.81 2.94% 7.27 10.65% 7.48 7.55%misex3 7.64 7.54 1.37% 7.15 6.84% 6.90 10.73% 6.61 15.53%tseng 10.97 11.73 −6.41% 8.90 23.33% 10.59 3.59% 8.68 26.44%alu4 8.45 8.26 2.26% 7.38 14.46% 7.23 16.78% 7.31 15.58%dsip 4.92 4.53 8.64% 3.83 28.57% 4.25 15.66% 3.83 28.57%seq 8.12 7.76 4.71% 7.34 10.61% 7.39 9.85% 7.34 10.61%diffeq 10.58 10.53 0.41% 10.06 5.10% 8.74 21.01% 9.56 10.59%apex2 9.56 9.23 3.56% 9.22 3.64% 7.82 22.25% 8.40 13.78%s298 19.54 18.74 4.23% 17.75 10.08% 14.62 33.61% 16.12 21.17%des 7.58 7.20 5.27% 6.91 9.67% 7.29 3.87% 6.45 17.42%bigkey 5.46 5.26 3.69% 4.57 19.51% 4.78 14.12% 4.31 26.67%spla 10.94 10.11 8.20% 9.85 10.99% 9.27 17.97% 9.22 18.67%elliptic 14.85 12.53 18.59% 11.27 31.77% 11.49 29.33% 10.60 40.17%ex1010 12.14 10.64 14.08% 10.95 10.83% 9.43 28.75% 10.58 14.74%pdc 12.84 11.56 11.03% 10.70 19.94% 10.06 27.56% 9.89 29.87%frisc 20.79 19.45 6.89% 16.67 24.74% 15.90 30.81% 15.85 31.16%s38584 9.27 8.37 10.84% 8.33 11.40% 7.47 24.08% 7.89 17.59%s38417 13.84 13.62 1.63% 10.88 27.17% 11.27 22.77% 10.29 34.52%clma 19.73 17.98 9.72% 15.19 29.87% 14.73 33.93% 14.62 34.98%Average 5.76% 15.29% 19.26% 21.41%

Table XIII. Routed Delay Comparison on Stratix

Circuit Quartus VPR % Path % SPC % SPD-m % SPCD %ex5p 7.81 8.32 −6.45% 8.12 −3.86% 8.22 −5.14% 8.57 −9.65% 7.62 2.53%apex4 7.56 8.19 −8.40% 7.81 −3.38% 7.97 −5.53% 8.06 −6.70% 7.47 1.08%misex3 7.32 8.03 −9.60% 7.54 −2.99% 7.47 −1.99% 7.52 −2.71% 7.57 −3.32%tseng 9.17 10.38 −13.17% 10.49 −14.39% 9.01 1.69% 10.11 −10.23% 8.96 2.34%alu4 7.51 8.68 −15.51% 8.35 −11.11% 7.72 −2.76% 7.80 −3.83% 7.60 −1.09%dsip 4.44 5.10 −14.76% 4.73 −6.48% 4.37 1.71% v4.41 0.63% 4.12 7.84%seq 7.19 8.32 −15.68% 8.27 −14.98% 7.54 −4.83% 7.91 −9.97% 7.47 −3.75%diffeq 10.51 10.80 −2.73% 10.78 −2.57% 10.57 −0.55% 10.01 4.76% 9.70 8.34%apex2 8.81 9.66 −9.68% 9.87 −11.97% 9.09 −3.16% 8.82 −0.10% 8.45 4.29%s298 17.29 20.64 −19.36% 19.33 −11.77% 16.75 3.12% 16.70 3.44% 16.08 7.52%des 6.07 7.32 −20.56% 6.80 −12.08% 6.63 −9.30% 7.20 −18.63% 6.31 −3.74%bigkey 5.47 6.04 −10.46% 5.51 −0.77% 4.66 14.79% 5.08 7.20% 4.72 15.91%spla 10.44 11.49 −10.03% 10.62 −1.70% 10.03 3.99% 10.40 0.44% 9.60 8.80%elliptic 11.36 15.13 −33.18% 12.39 −9.06% 11.20 1.42% 11.41 −0.46% 10.58 7.41%ex1010 10.97 12.18 −11.05% 11.28 −2.80% 11.06 −0.76% 10.67 2.71% 10.91 0.54%pdc 12.13 13.25 −9.18% 12.73 −4.89% 11.40 6.07% 12.13 0.05% 11.09 9.46%frisc 15.67 19.58 −24.92% 18.97 −21.04% 16.04 −2.33% 15.90 −1.46% 15.38 1.90%s38584 8.25 9.62 −16.61% 9.24 −11.95% 8.51 −3.11% 8.12 1.65% 8.04 2.64%s38417 11.49 14.49 −26.06% 14.29 −24.39% 11.20 2.51% 11.95 −4.00% 11.41 0.75%clma 16.01 20.35 −27.06% 17.83 −11.34% 15.59 2.65% 16.19 −1.10% 14.85 7.80%Avg −15.22% −9.17% −0.07% −2.40% 3.86%

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 30: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 769

Fig. 20. Critical path of Quartus II result (bigkey). (From Stratix Device Family Data Sheet, v3.2,July 2005, c©ALTERA 2005.)

wirelength) and consumes 15% of standard VPR’s runtime. When α = 0.25, SPCuses 33% of standard VPR’s runtime. SPC’s runtime increases up to 63% as α

increases to one. Table IX shows a similar trend when the size of the CLB is ten.

7.3.2 Area and Runtime Increase of SPD. In Table X we show the area (interms of the number of CLBs) and runtime overhead of our SPD algorithm.Regardless of the number of iterations of logic duplication we performed, thearea increase by both SPD-1 and SPD-m was very small, normally less than 1%.When we performed only postplacement logic duplication in SPD-1, the runtimeincrease was negligible; even when we performed multiple iterations of logicduplication in SPD-m, the average runtime increase remained very small at2%. We expected more runtime increase for SPD-m, but this was not the case.Our analysis of the annealing process reveals that logic duplication helps theplacement reach a local minimum faster, so SPD-m uses a smaller number ofannealing iterations than does SPD-0, in general.

7.4 Routed Results

In Table XI we show the comparison of the routed delay and track count betweenT-Vpack + VPR, SPC, and SPD-m using the default architecture. SPC outper-forms T-Vpack + VPR by 19% on average in routed delay, and the reduction

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 31: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

770 • G. Chen and J. Cong

Fig. 21. Critical path of SPCD result (bigkey). (From Stratix Device Family Data Sheet, v3.2, July2005, c©ALTERA 2005.)

in routed tracks is 12% on average. This is consistent with the estimated de-lay/wirelength reduction after placement. SPD-m outperforms T-Vpack + VPRby 18% on average in routed delay, with an average increase in routed tracksof 11%. The increase in the number of routed tracks is due to the increase inthe number of nodes and nets introduced by logic duplication. Note that therouted delay reduction is smaller than the estimated placement delay reduc-tion, which is probably due to the intrinsic inaccuracy of the delay model usedby placement.

7.5 Comparison with Quartus II 4.0

7.5.1 Placement Estimated Delay Comparison. Table XII shows the com-parison of placement estimated delays of different algorithms under the Stratixdelay model, which is based on a much more complex segmented routing archi-tecture. Note that in this section, the results obtained by “VPR” are are actuallyfrom SPD-0 instead of the original VPR. This is because the original VPR can-not model the Stratix device properly. As we mentioned in Section 7.2.2, ourimplementation SPD-0 generates results that are similar to those produced bythe original VPR.

In Table XII, if we use the path counting-based net weighting scheme, we canoutperform VPR by 6% (column 4); if we perform simultaneous placement with

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 32: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

Simultaneous Placement with Clustering and Duplication • 771

clustering, we can outperform VPR by 15% (column 6); if we perform simulta-neous placement with duplication, we can outperform VPR by 19% (column 8);if we perform simultaneous placement with both clustering and duplication,we can outperform VPR by 21% (column 8). Table XII shows that our unifiedsynthesis and placement tool SPCD significantly outperforms VPR on a widelyused commercial architecture, as well as on simplified academic architectures(Table VII).

7.5.2 Routed Delay Comparison. Table XIII shows the comparison of therouteds delay of different algorithms reported by the Quartus timer. The timingresults of VPR lose to Quartus II by 15% (column 4). If we use the path counting-based net weighting scheme, we lose to Quartus II by 9% (column 6); if weperform simultaneous placement with clustering, SPC achieves delay resultssimilar to Quartus II (column 8); if we perform simultaneous placement withduplication, SPD-m loses to Quartus II by 2% (column 10); if we perform simul-taneous placement with both clustering and duplication, SPCD outperformsQuartus II by 4% (column 12).

For example, we ran the circuit bigkey using both the standard Quartus IIflow and our SPCD Stratix flow. As shown in Figure 20, a critical path of 5.469 nscan be obtained from the Quartus II flow. As shown in Figure 21, a critical pathof 4.928 ns can be obtained from the SPCD flow. The cells on the critical pathin Figure 21 are placed closer and the delay improvement is 11%.

8. CONCLUSIONS

We introduce an efficient and effective algorithm for simultaneous placementwith clustering and duplication. By integrating novel techniques such as pathcounting-based net weighting, a fragment level move, simultaneous logic dupli-cation during placement, monotone region-based global path monotonicity op-timization, optimal legalization under complex constraints, duplication graphrepresentation, and redundancy removal, our new SPCD algorithm producesexcellent results for both wirelength and timing optimization. When comparedto a widely used separate academic FPGA design flow, T-VPack+VPR, acrossdifferent architectures, our algorithm improves up to 36% in wirelength and31% in the longest path delay, with less than 1% increase in area. Althoughwe test our algorithm in the context of FPGAs, the duplication and placementalgorithms apply directly to ASICs and other architectures as well. The SPCDpackage is available for download from http://ballade.cs.ucla.edu/∼chg/spcd.

REFERENCES

BERAUDO, G. AND LILLIS, J. 2003. Timing optimization of FPGA placements by logic replication.In Proceedings of the ACM/IEEE Design Automation Conference, 196–201.

BETZ, V. AND ROSE, J. 1997. VPR: A new packing, placement and routing tool for FPGA research.In Proceedings of the International Workshop on Field Programmable Logic and Application,213–222.

BOZORGZADEH, E., OGRENCI, S., AND SARRAFZADEH, M. 2001. Routability-Driven packing for cluster-based FPGAs. In Proceedings of the Asia and South Pacific Design Automation Conference(Yokohama, Japan). 629–634.

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.

Page 33: Simultaneous Placement with Clustering and Duplicationece506/project/2014/... · Simultaneous Placement with Clustering and Duplication • 745 use the same architecture as in Figure

772 • G. Chen and J. Cong

CHEN, G. AND CONG, J. 2004. Simultaneous timing driven clustering and placement for FPGAs.In Proceedings of the International Conference on Field Programmable Logic and Its Applications(Antwerp, Belgium). 158–167.

CHEN, G. AND CONG, J. 2005. Simultaneous timing-driven placement and duplication. In Pro-ceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, Calif.). 51–59.

CONG, J. AND DING, Y. 1994. FlowMap: An optimal technology mapping algorithm for delay op-timization in lookup-table based FPGA designs. IEEE Trans. Compute.-Aided Des. 13, 1 (Jan.),1–12.

CONG, J., LI, H., AND WU, C. 1999. Simultaneous circuit partitioning/clustering with retiming forperformance optimization. In Proceedings of the 36th ACM/IEEE Design Automation Conference(New Orleans, La.). 460–465.

HRKIC, M., LILLIS, J., AND BERAUDO, G. 2004. An approach to placement-coupled logic replication.In Proceedings of the ACM/IEEE Design Automation Conference (San Diego, Calif.). 711–716.

HUR, S.-W. AND LILLIS, J. 2000. Mongrel: Hybrid techniques for standard cell placement. In Pro-ceedings of the IEEE/ACM International Conference on Computer-Aided Design (San Jose, Calif.).165–170.

KONG, T. 2002. A novel net weighting algorithm for timing-driven placement. In Proceedings ofthe IEEE/ACM International Conference on Computer-Aided Design (San Jose, Calif.). 172–176.

LILLIS, J., CHENG, C.-K., AND LIN, T.-T. Y. 1996. Algorithms for optimal introduction of redundantlogic for timing and area optimization. In Proceedings of the IEEE International Symposium onCircuits and Systems, 196–201.

MARQUARDT, A., BETZ, V., AND ROSE, J. 1999. Using cluster-based logic blocks and timing-drivenpacking to improve FPGA speed and density. In Proceedings of the ACM/SIGDA InternationalSymposium on Field Programmable Gate Arrays (Monterey, Calif.). 37–46.

MARQUARDT, A., BETZ, V., AND ROSE, J. 2000. Timing-driven placement for FPGAs. In Proceedingsof the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey,Calif.). 203–213.

NEUMANN, I., STOFFEL, D., HARTJE, H., AND KUNZ, W. 1999. Cell replication and redundancy elimina-tion during placement for cycle time optimization. In Proceedings of the IEEE/ACM InternationalConference on Computer-Aided Design (San Jose, Calif.). 25–30.

SENTOVICH, E., SINGH, K., LAVAGNO, L., MOON, C., MURGAI, R., SALDANHA, A., SAVOJ, STEPHAN, P., BRAYTON,R., AND SANGIOVANNI-VINCENTELLI, A. 1992. SIS: A system for sequential circuit synthesis. Elec-tronics Research Laboratory. Memorandum No. UCB/ERL M92/41.

SRIVASTAVA, A., KASTNER, R., AND SARRAFZADEH, M. 2000. Timing driven gate duplication: Com-plexity issues and algorithms. In Proceedings of the IEEE/ACM International Conference onComputer-Aided Design (San Jose, Calif.). 447–450.

The FPGA PLACE-AND-ROUTE CHALLENGE. http://www.eecg.toronto.edu/∼vaughn/challenge/challenge.html.

STRATIX DEVICE HANDBOOK. http://www.altera.com/literature/lit-stx.jsp.SIMULTANEOUS PLACEMENT WITH CLUSTERING AND DUPLICATION FOR LUT-BASED FPGA DESIGNS. http://

ballade.cs.ucla.edu/∼chg/spcd.

Received April 2005; revised December 2005; accepted March 2006

ACM Transactions on Design Automation of Electronic Systems, Vol. 11, No. 3, July 2006.