603 Paper
-
Upload
maarij-raheem -
Category
Documents
-
view
214 -
download
0
Transcript of 603 Paper
-
7/29/2019 603 Paper
1/6
FPGA Placement Methodologies: A Survey
Xiaoyu ShiDepartment of Computing Science,
University of Alberta, Edmonton, Canada
Abstract
Field Programmable Gate Array (FPGA), a programmable integrated circuit, has gained great popularity in the circuit design
since its first introduction in 1984. Placement in FPGA decides the physical locations and inter connections of each logic block
in the circuit design, which now becomes the bottleneck of the circuit performance. In this survey paper, we shall review the
classic placement methods that have been used for the past two decades along with some modern placement techniques inthe last 5-10 years. In particular, we shall focus on four different categories of placement methods as following: simulated
annealing, min-cut, quadratic and parallel approaches. The methodology of each algorithm will be presented, with an emphasis
on the comparison of performances and evaluation of advantages and disadvantages.
1 Introduction
The popularity of using Field Programmable Gate Ar-
rays (FPGAs) to implement integrated circuits has been dra-
matically increased in recent years. The prime advantages
provided by FPGAs are their fast manufacturing turnaround
time, low start-up costs and ease of design that involves less
financial risks [8]. However, new challenges have emerged
since the size of FPGAs has reached million gates level.The design and development using FPGAs suffer from the
the large placement time as turnaround time is crucial [4].
The placement problem has become bottleneck of the cir-
cuit performance in FPGA. For the next generation of Com-
puter Aided Design (CAD) tools for FPGAs, fast and qual-
ity placement methods are critical.
In this survey, we use the island style FPGA model [6].
The generic structure of island style FPGA consists of four
main parts: Configurable Logic Blocks (CLB), which are
the basic logic blocks, implement the logic functions of the
circuit. Input/Output Blocks (IOB) are the connections of
FPGA and external devices. The connection block is used
to connect a CLB to the routing channels while the switch
block is used to connect the routing channels [6].
In the placement step, the netlist of logic blocks is placed
into FPGA circuit [6]. The optimization goal of placement
is to place the blocks in a proper location so that the ob-
jective function is minimized. There are three common op-
timization criteria for placement, time-driven, wire-length-
driven and path-driven. Time-driven placement attempts to
minimize the delay in the circuit while wire-length-driven
placement targets to minimize the total wire used. Path-
driven placement focuses on trying to put the logic blocks
on the critical path of the circuit so that both timing and
wire can be optimized.
In the following sections, the paper reviews four differ-
ent categories of FPGA placement methods, compares their
experimental results and analyzes the performances.
2 Simulated Annealing Placement
Simulated annealing placement mimics the annealing
process used to gradually cool molten metal to high quality
metal objects. An initial placement is created by randomly
placing the logic blocks in the circuit. A large number of
swapping blocks is made to gradually reduce the cost. In
this section, the well-know Versatile Place and Route (VPR)
tool using simulated annealing is reviewed [9].
2.1. Overview of Simulated AnnealingAn annealing process is to allow molecules cool down
in a controlled manner by temperature in order to find their
best fit in the system. The simulated annealing algorithm is
based on random movement of logic blocks, which is calledmove [6]. The cost function is defined to evaluate the
quality of the placement and a linear congestion cost func-
tion as following provides the best results in a reasonable
computation time [9].
Cost =
Nnets
n=1
q(n)[bbx(n)
Cav,x(n)+
bby(n)
Cav,y(n)]
The cost is the summation over the bounding box of
all nets in the circuit. For each block (xi, yi) in one
1
-
7/29/2019 603 Paper
2/6
Figure 1. Bounding Box of One Net.
net, the coordinates of the bounding box is defined as
(xmin, xmax, ymin, ymax). A bounding box of one net is il-lustrated in Figure 1. Note that the dashed line is the bound-
ing box of a net consists of four blocks.
Cav,x(n) and Cav,y(n) are the average channel capac-ities in the x and y directions over the bounding box of
net n [6]. Also, a compensation factor, q(n) is used forwiring length under estimation introduced by the bounding
box method.
Simulated annealing starts with a random placement of
each logic block in the circuit. After the initial placement, a
certain number of moves are performed to see whether the
cost is reduced or not according at certain temperature. If
the cost decreases, then the move is always accepted. How-
ever, if the cost increases, there is still probability for the
move to be accepted. The probability is given by eC/T,
where C is the change in the cost function the movecourses, and T is the temperature. This hill-climbing ability
allows simulated annealing method not to converge to local
minima and thus to reach global optimization [9].
A good annealing schedule is essential to the final re-
sults. With the motivation of increasing the amount of time
spent at temperatures where a significant of moves are be-
ing accepted, the following temperature update schedule is
used in VPR.
Tnew = Told
where is defined as shown in Table 1. Note that Roldaccepteis the percentage of the move that has been accepted at the
old temperature.
Even with a good annealing schedule, millions of block
swaps are evaluated at each temperature. The most time
consuming and computationally intensive part is calculating
the cost causes by the swap. It is crucial to make this part
as fast as possible. VPR also uses some heuristics to speed
up this process, such as using incremental net bounding box
update and a changing range of distant limit [6].
2.2. Pros and Cons
Raccept
Raccept > 0.96 0.5
0.8 < Raccept 0.96 0.9
0.15 < Raccept 0.8 0.95
Raccept 0.15 0.8
Table 1. Temperature Update Schedule
There are several advantages of the simulated annealing
placer. Fist of all, it outperforms the other placers as long
as direct comparisons can be made [9]. The FPGA CAD
tool VPR, which uses the simulated annealing method, has
become the state of the art tool in this field. Second, simu-
lated annealing placer has an open cost function which can
be defined as either wire-length-driven, time-driven or path-
driven. The cost function can also be the linear combination
of the above types though it is hard to decide the weights.
Third, simulated annealing can reach global minimum be-
cause of the hill-climbing ability. However, simulated an-
nealing is very slow because of the computationally expen-
sive and time consuming evaluation of each move. Besides,
due to the inherent sequential nature of simulated anneal-
ing, it is very hard to be paralleled using multi-core CPUs
or clusters.
3 Quadratic Placement
Quadratic placement method uses the squared wirelength as the objective function. It tries to minimize the cost
by solving the linear equations [12]. Although quadratic
placement only considers the squared wire length, it can ef-
ficiently finish the placement process with almost no quality
lost. As a result, it is widely used in the VLSI placement
[12].
3.1. Overview of Quadratic Placement
The input file for a quadratic placer is a hyper-graph
netlist and the process tries to minimize the total squared
distances between every two nodes. The cost will be com-
puted according to the formula:
(x, y) =1
2
m
i=1,j=1
Wi,j [(xi xj)2 + (yi yj)
2]
The coordinates of the logic block in the netlist are x and y.
Wi,j is the weight between node (xi, yi) and (xj , yj). Sincethe input is a hyper-graph, and two nodes can be connected
by more than more net, there are two models to convert the
hyper-graph into a graph [7].
The objective function can be rewritten into a matrix no-
2
-
7/29/2019 603 Paper
3/6
tation:
(x, y) =1
2xTQx + dTx x +
1
2yTQy + dTy y + const
where Q is an n n symmetric matrix and dx, dy are n-
dimensional vectors.
Because of the symmetric property, the objective func-tion can be separated into x dimension and y dimension re-
spectively. Then the function looks like as following with
only one dimension considered:
(x, y) =1
2xTQx + dTx x + const
In order to find the minimum value, let (x) = 0which results in the following matrix equation:
Qx + dx = 0
This is the quadratic equation that minimizes the to-
tal squared wire length and can be solved by using non-stationary iterative methods.
The algorithm proposed in [12] can be divided into three
stages.
In the first stage, by repeatedly building up, modifying
and solving linear equations, a good initial placement
can be obtained. This stage is performed until no sig-
nificant improvement can be achieved.
In stage 2, instead of building and solving linear equa-
tions, nodes can be directly moved to reduce the total
wire length since stage 1 has already given a reason-
ably good initial placement. The process in stage 2
is much faster than stage 1 so more iterations can be
performed to get a better refinement.
Finally, simulated annealing can be used to further re-
fine the placement with low temperature.
3.2. Pros and ConsThe main advantage of the quadratic placement tech-
nique is that it significantly improves the run time with al-
most no quality lost compared to VPR. According to the
results shown in [12], across the 20 MCNC benchmark cir-
cuits [5], QPF runs 5.8 times faster than VPR on average
while the wire length obtained by QPF is only 1.9% more
than VPR. By using better algebra method to solve the lin-ear equation, the run time might be further reduced. How-
ever, since the squared wire length is the only factor consid-
ered in the objective function, the timing part of the place-
ment can not be shown in the quadratic placement.
4 Min-Cut Placement
Partitioning-based placement algorithms have been fast
and hence scalable for large Application Specific Integrated
Circuit (ASIC) placement and have also been applied to
FPGAs. One of the recent partitioning-based placement
method, named min-cut placement, recursively applies bi-
partitioning to map the netlist of a circuit into the FPGA
layout region. It minimizes the number of cuts of the nets
while in the mean time, leaves the highly connected logic
blocks in one partition [3].4.1. Overview of MinCut PlacementDelay optimization is very important in circuit design.
Effective delay minimization on large circuits is possible
only by accounting for performance as early as possible in
the design flow. Min-cut placement targets delay minimiza-
tion on the placement stage, which is an early step in the
design process.
The min-cut placer employs the fundamental divide-and-
conquer method. A circuit is recursively bi-partitioned in a
breadth first manner as shown in Figure 2.
Figure 2. Bi-partitioning Process of Min-Cut.
The cut direction (horizontal or vertical) is decided based
on the criticality of the nets crossing the four borders so
that the total cut numbers are minimized [3]. This recursive
process is repeated until each partition contains only a few
blocks to group the highly connected blocks together in or-
der to decrease placement cost. The goal of min-cut is to
find a proper partition that cuts fewest wires in the net.
All the edges in the net are weighted with timing crit-
icality, as well as terminal alignment of critical nets [3].
The algorithm can be divided into three stages. In the first
stage, min-cut uses the state of the art multilevel partitioner
hMetis [11] as its partitioning engine. During the partition-
ing process, a tight connection between the circuit graph
and placement is maintained, which represents coordinates
of all blocks on the FPGA fabric. Recursive partitioning is
done until each leaf partition has only a few blocks while
in some cases, some leaf nodes might contain more nodes
than it can accommodate, so overlaps must be removed. In
stage two, overlaps are removed by using a greedy tech-
nique, which moves blocks to the closest best aligned parti-
tion. Finally, the placement is refined by using a low tem-
perature simulated annealing method to further minimize
3
-
7/29/2019 603 Paper
4/6
the delay.
4.2. Pros and Cons
The advantage of the min-cut placement technique is that
it minimizes the delay in the placement stage, which lays
the foundation for designing a better performance circuit.
Besides, the run time reported in [3] shows that an average
3-4x speed up is gained compared to VPR on 20 MCNC
benchmarks with a slight degradation in the quality. How-
ever, the results of min-cut is relied on how well the par-
tition is performed. Current research is focused on finding
some heuristics to better partition the circuit. Also, min-cut
placer may not be able to reach the global minimum because
of some of the greedy strategies it uses.
5 Parallel Placement
As the scale of modern FPGAs has reached millions of
logic blocks, more efficient and scalable FPGA placementalgorithms are needed. Parallelization is an appealing so-
lution for providing fast placements due to the rapid de-
velopment of multi-core CPUs in recent years. The par-
allel approaches that we are going to review are based on
simulated annealing since it outperforms the others while
the main drawback is its time consuming move. We divide
modern simulated annealing based parallel FPGA placers
into three categories: parallel move approach, area based
approach and deterministic parallel approach.
5.1. Parallel Move Approach
Since there are quite a large number of moves at each
temperature, the motivation of the parallel move approachis trying to accelerate the simulated annealing process by
performing several moves at the same time. There are three
possible cases after each move is done. (i) two blocks are
swapped (ii) a block is moved to an empty location (iii) the
move is rejected. Moves can be done in parallel only if they
do not move the same block or move to the same location.
Figure 3 shows a simple example of the parallel move. Note
that Move 1 and Move 2 can be done in parallel since they
are totally independent while Move 2 and Move 3 can not
because they are trying to move block 3 to different loca-
tions at the same time.
However, ensuring the above can only guarantee there
are no move collisions while net cost collision might still
happen. As shown in Figure 4, block 1 and block 3 belong
to the same net. While move 1 and move 2 are done in par-
allel, the resulting bounding box of move 1 is the bounding
box of block 2 and 3 while the resulting bounding box of
move 2 is the bounding box of block 1 and 4. Two moves
that move blocks of the same net may evaluate the bound-
ing box incorrectly as each one of the moves can not take
into account the fact that the other move is changing the
bounding box.
Figure 3. Parallel Moves Approach.
Generally, there are two ways to deal with the move col-
lision and net cost collision.
Ignoring the errors in the cost function is the easiest
way to deal with collisions. But it has negative effectson the accuracy of the cost which interferes with the
acceptance of moves. This adversely affects the re-
sults.
Find the disjoint moves that not only move different
blocks, but also belong to different nets. The over re-
stricted moves results in a smaller swap space and the
synchronization overheads tend to overwhelm the gain
in parallelism.
Figure 4. Net Cost Collision.
Both of these two methods show negative speedups [1].
The reason is due to the overhead of synchronization out-
weighs the advantages of parallelization. But the thought of
trying to parallelize the moves inspires many other parallel
FPGA placement methods.
5.2. Area Based Approach
The area based approach is motivated by solving the col-
lision illustrated in the parallel move approach. It partitions
the area of FPGA and assigns the partitioned areas to differ-
ent processors. As shown in Figure 5, the whole circuit is
4
-
7/29/2019 603 Paper
5/6
Figure 5. Collision in Area Partitioning.
partitioned into four parts, and each processor is in charge
of one partition.
The moves evaluated are much less restricted than the
move parallel approach. However, collisions could still hap-
pen because multiple processors may move blocks belong-ing to the same net across the partition as presented in Fig-
ure 5. For example, the bounding box of block 1, 2 and 3
can not be computed since they belong to different parti-
tions. These errors can be tolerated because we do not ex-
pect the net which spans over two or more partitions happen
very often. Moreover, with cooling temperature, the swaps
are tend to happen between nearby blocks.
Since each processor can only move blocks within its
own partitioned area, to allow the placement to reach global
minimum, the partition must be carefully performed so that
each block has the freedom to move to any arbitrary loca-
tions in FPGA. The area based approach uses both horizon-
tal and vertical partition to ensure global minimum could be
reached [1].
The experimental results show a non linear speed up has
been gained compared to the sequential placer and the cost
does not degrade with the increasing processors. This is due
to the less synchronization requirements.
5.3. Deterministic Parallel ApproachOne of the constrains of parallelism is the non-
determinism of the results. This constrain is seldom studied
in the past work (an exception is [2]), but is vital in a com-
mercial context for the following two reasons [10].
When user uses a commercial FPGA placement tool,
he must be able to reproduce the problem when a bug is
reported. Non-determinism makes this extremely dif-
ficult because the results are different for each run.
In the release testing stage of building a placer, it
would be terribly difficult to look into failing tests
since the results changed randomly.
The algorithm proposed in [10] parallelizes the place-
ment while at the mean time, keeps the results determinis-
tic. The deterministic parallel approach partitions a move
into two stages: processing and finalization. As shown in
Figure 6, during the processing stage, each processor pro-
posals a move and evaluates it. This takes the vast majority
of time and thus occurs in parallel. In order to avoid colli-
sions and maintain the deterministic property, the calculated
moves are put in a queue and a dependency check is needed
to ensure there is no collision and re-propose moves thathave collided. Note that the finalization part can be done by
any of the idle processor. In our example, C0 is idle when
all the moves in queue have been checked, thus C0 does the
finalization job.
Figure 6. Deterministic Parallel Approach.
There are several advantages of the deterministic paral-
lel approach. Firstly, speed up can be linear given the as-
sumption that the finalization time is negligible. Secondly,
a move is now processed entirely by one processor, which
improves the memory locality. Thirdly, the results are de-
terministic and serial equivalent.
6 Future Work
Algorithms for FPGA placement play a vital role in mod-
ern integrated circuit design. As the comparisons of results
show in this paper, placement is still bottleneck even though
tradeoff can be made between quality and efficiency. The
potential of improving both run time and quality still exists
by using parallel methods[10].
Ideally, we want to systematically compare the results of
each placement algorithm. However, direct comparisons of
these algorithms are difficult, partly because of the limited
access to the algorithms, and partly due to their different
assumptions. Comparisons based on wire-length placement
have been attempted in [5]. More work is needed to build
a common framework to directly compare the performance
of different FPGA placers.
7 Conclusion
In this paper, a number of different FPGA placement
algorithms are reviewed and a summary of comparison is
5
-
7/29/2019 603 Paper
6/6
Placement Quality Placement Efficiency
Simulated Annealing Overall, it gives the best results particularly
when using wire length as the cost function.
The final placement result can reach global
optimization.
It is very slow because of the computationally
expensive evaluation of each move.
Quadratic Placement On average, it requires 1.9% more wire
length compared to simulated annealingmethod. It only considers wire length factor
in the cost function while timing part can not
be shown.
Compared to simulated annealing, it can be
5.8 times faster on average.
Min-Cut Placement The final result varies dramatically based on
how the partition is made. To the best of our
knowledge, no direct comparison of results to
simulated annealing has been made.
An average of 3-4 times speedup can be
gained compared to simulated annealing.
Parallel Placement In general, using the deterministic parallel
placement, the results are as good as the sim-
ulated annealing.
The best speedup can be linear with the num-
ber of processors if the finalization time is
negligible.
Figure 7. Overall Comparison of the Four Placement Methods
shown in Figure 7.
Simulated annealing placement in general, outperforms
the other placers with regard to the final results. Besides, it
can overcome local minimum and has open cost function.
However, it is too time consuming. Quadratic placement
gives fast run time while the results can not reach global
optimization and timing factor can not be shown in the cost
function. Min-cut method can also give speed up on run
time while the quality of the placement still can not be guar-
anteed. Parallel algorithms can give good speedups andwith almost no quality lost. But the scalability is restricted
by the overhead due to memory.
References
[1] A.Nayak A.Choudhary, M.Haldar and P.Banerjee.
Parallel algorithms for fpga placement. Proc. of the
10th Great Lakes Symposium on VLSI, pages 8694,
2000.
[2] J.Chandy S.Kim B.Rankumar, S.Parkers and
P.Banerjee. An evaluation of parallel simulated
annealing strategies with application to standard cellplacement. TCAD, 16:398410, 1997.
[3] P.Maidee C.Ababei and K.Bazargan. Time-driven
partitioning-based placement for island style fpgas.
IEEE Transactions on Computer Aided Design of In-
tegrated Circuits and Systems, 24:395406, 2005.
[4] J.Cong D.Chen and P.Pan. Fpga design automation: A
survey. Foundations and Trends in Electronic Design
Automation, 1, 2006.
[5] S.Yildiz M.Markov I. Villarrubia, P.Parakh and Mad-
den. Benchmarking for large scale placement and be-
yond. Proc. of the International Symposium on Phys-
ical Design, pages 95103, 2003.
[6] V.Betz J.Rose and A.Marquardt. Architecture and
CAD for Deep-Submicron FPGAs. Kluwer Academic,
1999.
[7] N.Viswanathan and C.Chu. Fastplace: Efficient ana-
lytical placement using cell shifting, iterative local re-finement and a hybrid net mode. Proc. of ISPD, 2004.
[8] S.Brown and J.Rose. Fpga and cpld architectures: A
tutorial. IEEE Design and Test of Computers, 12:42
57, 1996.
[9] V.Betz and J.Rose. Vpr: A new packing, placement
and routing tool for fpga research. International Work-
shop on Field Programmable Logic and Applications,
pages 213222, 1997.
[10] A.Ludwin V.Betz and K.Padalia. High-quality, deter-
ministic parallel placement for fpgas on commodity
hardware. ACM/Sigda Int. Symp. on FPGAs, pages1423, 2008.
[11] R.Aggarwal V.Kumar, G.Karypis and S.Shekhar. Mul-
tilevel hypergraph partitioning: Application in vlsi do-
main. Proc. ACM/IEEE DAC, 1997.
[12] Y.Xu and M.A.S. Khalid. Qfd: Efficient quadratic
placement for fpgas. International Conference on
Field Programmable Logic and Application, pages
555558, 2005.
6