603 Paper

7/29/2019 603 Paper

1/6

FPGA Placement Methodologies: A Survey

Xiaoyu ShiDepartment of Computing Science,

University of Alberta, Edmonton, Canada

[email protected]

Abstract

Field Programmable Gate Array (FPGA), a programmable integrated circuit, has gained great popularity in the circuit design

since its first introduction in 1984. Placement in FPGA decides the physical locations and inter connections of each logic block

in the circuit design, which now becomes the bottleneck of the circuit performance. In this survey paper, we shall review the

classic placement methods that have been used for the past two decades along with some modern placement techniques inthe last 5-10 years. In particular, we shall focus on four different categories of placement methods as following: simulated

annealing, min-cut, quadratic and parallel approaches. The methodology of each algorithm will be presented, with an emphasis

on the comparison of performances and evaluation of advantages and disadvantages.

1 Introduction

The popularity of using Field Programmable Gate Ar-

rays (FPGAs) to implement integrated circuits has been dra-

matically increased in recent years. The prime advantages

provided by FPGAs are their fast manufacturing turnaround

time, low start-up costs and ease of design that involves less

financial risks [8]. However, new challenges have emerged

since the size of FPGAs has reached million gates level.The design and development using FPGAs suffer from the

the large placement time as turnaround time is crucial [4].

The placement problem has become bottleneck of the cir-

cuit performance in FPGA. For the next generation of Com-

puter Aided Design (CAD) tools for FPGAs, fast and qual-

ity placement methods are critical.

In this survey, we use the island style FPGA model [6].

The generic structure of island style FPGA consists of four

main parts: Configurable Logic Blocks (CLB), which are

the basic logic blocks, implement the logic functions of the

circuit. Input/Output Blocks (IOB) are the connections of

FPGA and external devices. The connection block is used

to connect a CLB to the routing channels while the switch

block is used to connect the routing channels [6].

In the placement step, the netlist of logic blocks is placed

into FPGA circuit [6]. The optimization goal of placement

is to place the blocks in a proper location so that the ob-

jective function is minimized. There are three common op-

timization criteria for placement, time-driven, wire-length-

driven and path-driven. Time-driven placement attempts to

minimize the delay in the circuit while wire-length-driven

placement targets to minimize the total wire used. Path-

driven placement focuses on trying to put the logic blocks

on the critical path of the circuit so that both timing and

wire can be optimized.

In the following sections, the paper reviews four differ-

ent categories of FPGA placement methods, compares their

experimental results and analyzes the performances.

2 Simulated Annealing Placement

Simulated annealing placement mimics the annealing

process used to gradually cool molten metal to high quality

metal objects. An initial placement is created by randomly

placing the logic blocks in the circuit. A large number of

swapping blocks is made to gradually reduce the cost. In

this section, the well-know Versatile Place and Route (VPR)

tool using simulated annealing is reviewed [9].

2.1. Overview of Simulated AnnealingAn annealing process is to allow molecules cool down

in a controlled manner by temperature in order to find their

best fit in the system. The simulated annealing algorithm is

based on random movement of logic blocks, which is calledmove [6]. The cost function is defined to evaluate the

quality of the placement and a linear congestion cost func-

tion as following provides the best results in a reasonable

computation time [9].

Cost =

Nnets

n=1

q(n)[bbx(n)

Cav,x(n)+

bby(n)

Cav,y(n)]

The cost is the summation over the bounding box of

all nets in the circuit. For each block (xi, yi) in one

1

7/29/2019 603 Paper

2/6

Figure 1. Bounding Box of One Net.

net, the coordinates of the bounding box is defined as

(xmin, xmax, ymin, ymax). A bounding box of one net is il-lustrated in Figure 1. Note that the dashed line is the bound-

ing box of a net consists of four blocks.

Cav,x(n) and Cav,y(n) are the average channel capac-ities in the x and y directions over the bounding box of

net n [6]. Also, a compensation factor, q(n) is used forwiring length under estimation introduced by the bounding

box method.

Simulated annealing starts with a random placement of

each logic block in the circuit. After the initial placement, a

certain number of moves are performed to see whether the

cost is reduced or not according at certain temperature. If

the cost decreases, then the move is always accepted. How-

ever, if the cost increases, there is still probability for the

move to be accepted. The probability is given by eC/T,

where C is the change in the cost function the movecourses, and T is the temperature. This hill-climbing ability

allows simulated annealing method not to converge to local

minima and thus to reach global optimization [9].

A good annealing schedule is essential to the final re-

sults. With the motivation of increasing the amount of time

spent at temperatures where a significant of moves are be-

ing accepted, the following temperature update schedule is

used in VPR.

Tnew = Told

where is defined as shown in Table 1. Note that Roldaccepteis the percentage of the move that has been accepted at the

old temperature.

Even with a good annealing schedule, millions of block

swaps are evaluated at each temperature. The most time

consuming and computationally intensive part is calculating

the cost causes by the swap. It is crucial to make this part

as fast as possible. VPR also uses some heuristics to speed

up this process, such as using incremental net bounding box

update and a changing range of distant limit [6].

2.2. Pros and Cons

Raccept

Raccept > 0.96 0.5

0.8 < Raccept 0.96 0.9

0.15 < Raccept 0.8 0.95

Raccept 0.15 0.8

Table 1. Temperature Update Schedule

There are several advantages of the simulated annealing

placer. Fist of all, it outperforms the other placers as long

as direct comparisons can be made [9]. The FPGA CAD

tool VPR, which uses the simulated annealing method, has

become the state of the art tool in this field. Second, simu-

lated annealing placer has an open cost function which can

be defined as either wire-length-driven, time-driven or path-

driven. The cost function can also be the linear combination

of the above types though it is hard to decide the weights.

Third, simulated annealing can reach global minimum be-

cause of the hill-climbing ability. However, simulated an-

nealing is very slow because of the computationally expen-

sive and time consuming evaluation of each move. Besides,

due to the inherent sequential nature of simulated anneal-

ing, it is very hard to be paralleled using multi-core CPUs

or clusters.

3 Quadratic Placement

Quadratic placement method uses the squared wirelength as the objective function. It tries to minimize the cost

by solving the linear equations [12]. Although quadratic

placement only considers the squared wire length, it can ef-

ficiently finish the placement process with almost no quality

lost. As a result, it is widely used in the VLSI placement

[12].

3.1. Overview of Quadratic Placement

The input file for a quadratic placer is a hyper-graph

netlist and the process tries to minimize the total squared

distances between every two nodes. The cost will be com-

puted according to the formula:

(x, y) =1

2

m

i=1,j=1

Wi,j [(xi xj)2 + (yi yj)

2]

The coordinates of the logic block in the netlist are x and y.

Wi,j is the weight between node (xi, yi) and (xj , yj). Sincethe input is a hyper-graph, and two nodes can be connected

by more than more net, there are two models to convert the

hyper-graph into a graph [7].

The objective function can be rewritten into a matrix no-

2

7/29/2019 603 Paper

3/6

tation:

(x, y) =1

2xTQx + dTx x +

1

2yTQy + dTy y + const

where Q is an n n symmetric matrix and dx, dy are n-

dimensional vectors.

Because of the symmetric property, the objective func-tion can be separated into x dimension and y dimension re-

spectively. Then the function looks like as following with

only one dimension considered:

(x, y) =1

2xTQx + dTx x + const

In order to find the minimum value, let (x) = 0which results in the following matrix equation:

Qx + dx = 0

This is the quadratic equation that minimizes the to-

tal squared wire length and can be solved by using non-stationary iterative methods.

The algorithm proposed in [12] can be divided into three

stages.

In the first stage, by repeatedly building up, modifying

and solving linear equations, a good initial placement

can be obtained. This stage is performed until no sig-

nificant improvement can be achieved.

In stage 2, instead of building and solving linear equa-

tions, nodes can be directly moved to reduce the total

wire length since stage 1 has already given a reason-

ably good initial placement. The process in stage 2

is much faster than stage 1 so more iterations can be

performed to get a better refinement.

Finally, simulated annealing can be used to further re-

fine the placement with low temperature.

3.2. Pros and ConsThe main advantage of the quadratic placement tech-

nique is that it significantly improves the run time with al-

most no quality lost compared to VPR. According to the

results shown in [12], across the 20 MCNC benchmark cir-

cuits [5], QPF runs 5.8 times faster than VPR on average

while the wire length obtained by QPF is only 1.9% more

than VPR. By using better algebra method to solve the lin-ear equation, the run time might be further reduced. How-

ever, since the squared wire length is the only factor consid-

ered in the objective function, the timing part of the place-

ment can not be shown in the quadratic placement.

4 Min-Cut Placement

Partitioning-based placement algorithms have been fast

and hence scalable for large Application Specific Integrated

Circuit (ASIC) placement and have also been applied to

FPGAs. One of the recent partitioning-based placement

method, named min-cut placement, recursively applies bi-

partitioning to map the netlist of a circuit into the FPGA

layout region. It minimizes the number of cuts of the nets

while in the mean time, leaves the highly connected logic

blocks in one partition [3].4.1. Overview of MinCut PlacementDelay optimization is very important in circuit design.

Effective delay minimization on large circuits is possible

only by accounting for performance as early as possible in

the design flow. Min-cut placement targets delay minimiza-

tion on the placement stage, which is an early step in the

design process.

The min-cut placer employs the fundamental divide-and-

conquer method. A circuit is recursively bi-partitioned in a

breadth first manner as shown in Figure 2.

Figure 2. Bi-partitioning Process of Min-Cut.

The cut direction (horizontal or vertical) is decided based

on the criticality of the nets crossing the four borders so

that the total cut numbers are minimized [3]. This recursive

process is repeated until each partition contains only a few

blocks to group the highly connected blocks together in or-

der to decrease placement cost. The goal of min-cut is to

find a proper partition that cuts fewest wires in the net.

All the edges in the net are weighted with timing crit-

icality, as well as terminal alignment of critical nets [3].

The algorithm can be divided into three stages. In the first

stage, min-cut uses the state of the art multilevel partitioner

hMetis [11] as its partitioning engine. During the partition-

ing process, a tight connection between the circuit graph

and placement is maintained, which represents coordinates

of all blocks on the FPGA fabric. Recursive partitioning is

done until each leaf partition has only a few blocks while

in some cases, some leaf nodes might contain more nodes

than it can accommodate, so overlaps must be removed. In

stage two, overlaps are removed by using a greedy tech-

nique, which moves blocks to the closest best aligned parti-

tion. Finally, the placement is refined by using a low tem-

perature simulated annealing method to further minimize

3

7/29/2019 603 Paper

4/6

the delay.

4.2. Pros and Cons

The advantage of the min-cut placement technique is that

it minimizes the delay in the placement stage, which lays

the foundation for designing a better performance circuit.

Besides, the run time reported in [3] shows that an average

3-4x speed up is gained compared to VPR on 20 MCNC

benchmarks with a slight degradation in the quality. How-

ever, the results of min-cut is relied on how well the par-

tition is performed. Current research is focused on finding

some heuristics to better partition the circuit. Also, min-cut

placer may not be able to reach the global minimum because

of some of the greedy strategies it uses.

5 Parallel Placement

As the scale of modern FPGAs has reached millions of

logic blocks, more efficient and scalable FPGA placementalgorithms are needed. Parallelization is an appealing so-

lution for providing fast placements due to the rapid de-

velopment of multi-core CPUs in recent years. The par-

allel approaches that we are going to review are based on

simulated annealing since it outperforms the others while

the main drawback is its time consuming move. We divide

modern simulated annealing based parallel FPGA placers

into three categories: parallel move approach, area based

approach and deterministic parallel approach.

5.1. Parallel Move Approach

Since there are quite a large number of moves at each

temperature, the motivation of the parallel move approachis trying to accelerate the simulated annealing process by

performing several moves at the same time. There are three

possible cases after each move is done. (i) two blocks are

swapped (ii) a block is moved to an empty location (iii) the

move is rejected. Moves can be done in parallel only if they

do not move the same block or move to the same location.

Figure 3 shows a simple example of the parallel move. Note

that Move 1 and Move 2 can be done in parallel since they

are totally independent while Move 2 and Move 3 can not

because they are trying to move block 3 to different loca-

tions at the same time.

However, ensuring the above can only guarantee there

are no move collisions while net cost collision might still

happen. As shown in Figure 4, block 1 and block 3 belong

to the same net. While move 1 and move 2 are done in par-

allel, the resulting bounding box of move 1 is the bounding

box of block 2 and 3 while the resulting bounding box of

move 2 is the bounding box of block 1 and 4. Two moves

that move blocks of the same net may evaluate the bound-

ing box incorrectly as each one of the moves can not take

into account the fact that the other move is changing the

bounding box.

Figure 3. Parallel Moves Approach.

Generally, there are two ways to deal with the move col-

lision and net cost collision.

Ignoring the errors in the cost function is the easiest

way to deal with collisions. But it has negative effectson the accuracy of the cost which interferes with the

acceptance of moves. This adversely affects the re-

sults.

Find the disjoint moves that not only move different

blocks, but also belong to different nets. The over re-

stricted moves results in a smaller swap space and the

synchronization overheads tend to overwhelm the gain

in parallelism.

Figure 4. Net Cost Collision.

Both of these two methods show negative speedups [1].

The reason is due to the overhead of synchronization out-

weighs the advantages of parallelization. But the thought of

trying to parallelize the moves inspires many other parallel

FPGA placement methods.

5.2. Area Based Approach

The area based approach is motivated by solving the col-

lision illustrated in the parallel move approach. It partitions

the area of FPGA and assigns the partitioned areas to differ-

ent processors. As shown in Figure 5, the whole circuit is

4

7/29/2019 603 Paper

5/6

Figure 5. Collision in Area Partitioning.

partitioned into four parts, and each processor is in charge

of one partition.

The moves evaluated are much less restricted than the

move parallel approach. However, collisions could still hap-

pen because multiple processors may move blocks belong-ing to the same net across the partition as presented in Fig-

ure 5. For example, the bounding box of block 1, 2 and 3

can not be computed since they belong to different parti-

tions. These errors can be tolerated because we do not ex-

pect the net which spans over two or more partitions happen

very often. Moreover, with cooling temperature, the swaps

are tend to happen between nearby blocks.

Since each processor can only move blocks within its

own partitioned area, to allow the placement to reach global

minimum, the partition must be carefully performed so that

each block has the freedom to move to any arbitrary loca-

tions in FPGA. The area based approach uses both horizon-

tal and vertical partition to ensure global minimum could be

reached [1].

The experimental results show a non linear speed up has

been gained compared to the sequential placer and the cost

does not degrade with the increasing processors. This is due

to the less synchronization requirements.

5.3. Deterministic Parallel ApproachOne of the constrains of parallelism is the non-

determinism of the results. This constrain is seldom studied

in the past work (an exception is [2]), but is vital in a com-

mercial context for the following two reasons [10].

When user uses a commercial FPGA placement tool,

he must be able to reproduce the problem when a bug is

reported. Non-determinism makes this extremely dif-

ficult because the results are different for each run.

In the release testing stage of building a placer, it

would be terribly difficult to look into failing tests

since the results changed randomly.

The algorithm proposed in [10] parallelizes the place-

ment while at the mean time, keeps the results determinis-

tic. The deterministic parallel approach partitions a move

into two stages: processing and finalization. As shown in

Figure 6, during the processing stage, each processor pro-

posals a move and evaluates it. This takes the vast majority

of time and thus occurs in parallel. In order to avoid colli-

sions and maintain the deterministic property, the calculated

moves are put in a queue and a dependency check is needed

to ensure there is no collision and re-propose moves thathave collided. Note that the finalization part can be done by

any of the idle processor. In our example, C0 is idle when

all the moves in queue have been checked, thus C0 does the

finalization job.

Figure 6. Deterministic Parallel Approach.

There are several advantages of the deterministic paral-

lel approach. Firstly, speed up can be linear given the as-

sumption that the finalization time is negligible. Secondly,

a move is now processed entirely by one processor, which

improves the memory locality. Thirdly, the results are de-

terministic and serial equivalent.

6 Future Work

Algorithms for FPGA placement play a vital role in mod-

ern integrated circuit design. As the comparisons of results

show in this paper, placement is still bottleneck even though

tradeoff can be made between quality and efficiency. The

potential of improving both run time and quality still exists

by using parallel methods[10].

Ideally, we want to systematically compare the results of

each placement algorithm. However, direct comparisons of

these algorithms are difficult, partly because of the limited

access to the algorithms, and partly due to their different

assumptions. Comparisons based on wire-length placement

have been attempted in [5]. More work is needed to build

a common framework to directly compare the performance

of different FPGA placers.

7 Conclusion

In this paper, a number of different FPGA placement

algorithms are reviewed and a summary of comparison is

5

7/29/2019 603 Paper

6/6

Placement Quality Placement Efficiency

Simulated Annealing Overall, it gives the best results particularly

when using wire length as the cost function.

The final placement result can reach global

optimization.

It is very slow because of the computationally

expensive evaluation of each move.

Quadratic Placement On average, it requires 1.9% more wire

length compared to simulated annealingmethod. It only considers wire length factor

in the cost function while timing part can not

be shown.

Compared to simulated annealing, it can be

5.8 times faster on average.

Min-Cut Placement The final result varies dramatically based on

how the partition is made. To the best of our

knowledge, no direct comparison of results to

simulated annealing has been made.

An average of 3-4 times speedup can be

gained compared to simulated annealing.

Parallel Placement In general, using the deterministic parallel

placement, the results are as good as the sim-

ulated annealing.

The best speedup can be linear with the num-

ber of processors if the finalization time is

negligible.

Figure 7. Overall Comparison of the Four Placement Methods

shown in Figure 7.

Simulated annealing placement in general, outperforms

the other placers with regard to the final results. Besides, it

can overcome local minimum and has open cost function.

However, it is too time consuming. Quadratic placement

gives fast run time while the results can not reach global

optimization and timing factor can not be shown in the cost

function. Min-cut method can also give speed up on run

time while the quality of the placement still can not be guar-

anteed. Parallel algorithms can give good speedups andwith almost no quality lost. But the scalability is restricted

by the overhead due to memory.

References

[1] A.Nayak A.Choudhary, M.Haldar and P.Banerjee.

Parallel algorithms for fpga placement. Proc. of the

10th Great Lakes Symposium on VLSI, pages 8694,

2000.

[2] J.Chandy S.Kim B.Rankumar, S.Parkers and

P.Banerjee. An evaluation of parallel simulated

annealing strategies with application to standard cellplacement. TCAD, 16:398410, 1997.

[3] P.Maidee C.Ababei and K.Bazargan. Time-driven

partitioning-based placement for island style fpgas.

IEEE Transactions on Computer Aided Design of In-

tegrated Circuits and Systems, 24:395406, 2005.

[4] J.Cong D.Chen and P.Pan. Fpga design automation: A

survey. Foundations and Trends in Electronic Design

Automation, 1, 2006.

[5] S.Yildiz M.Markov I. Villarrubia, P.Parakh and Mad-

den. Benchmarking for large scale placement and be-

yond. Proc. of the International Symposium on Phys-

ical Design, pages 95103, 2003.

[6] V.Betz J.Rose and A.Marquardt. Architecture and

CAD for Deep-Submicron FPGAs. Kluwer Academic,

1999.

[7] N.Viswanathan and C.Chu. Fastplace: Efficient ana-

lytical placement using cell shifting, iterative local re-finement and a hybrid net mode. Proc. of ISPD, 2004.

[8] S.Brown and J.Rose. Fpga and cpld architectures: A

tutorial. IEEE Design and Test of Computers, 12:42

57, 1996.

[9] V.Betz and J.Rose. Vpr: A new packing, placement

and routing tool for fpga research. International Work-

shop on Field Programmable Logic and Applications,

pages 213222, 1997.

[10] A.Ludwin V.Betz and K.Padalia. High-quality, deter-

ministic parallel placement for fpgas on commodity

hardware. ACM/Sigda Int. Symp. on FPGAs, pages1423, 2008.

[11] R.Aggarwal V.Kumar, G.Karypis and S.Shekhar. Mul-

tilevel hypergraph partitioning: Application in vlsi do-

main. Proc. ACM/IEEE DAC, 1997.

[12] Y.Xu and M.A.S. Khalid. Qfd: Efficient quadratic

placement for fpgas. International Conference on

Field Programmable Logic and Application, pages

555558, 2005.

6

603 Paper

Documents

Transcript of 603 Paper