603 Paper

download 603 Paper

of 6

Transcript of 603 Paper

  • 7/29/2019 603 Paper

    1/6

    FPGA Placement Methodologies: A Survey

    Xiaoyu ShiDepartment of Computing Science,

    University of Alberta, Edmonton, Canada

    [email protected]

    Abstract

    Field Programmable Gate Array (FPGA), a programmable integrated circuit, has gained great popularity in the circuit design

    since its first introduction in 1984. Placement in FPGA decides the physical locations and inter connections of each logic block

    in the circuit design, which now becomes the bottleneck of the circuit performance. In this survey paper, we shall review the

    classic placement methods that have been used for the past two decades along with some modern placement techniques inthe last 5-10 years. In particular, we shall focus on four different categories of placement methods as following: simulated

    annealing, min-cut, quadratic and parallel approaches. The methodology of each algorithm will be presented, with an emphasis

    on the comparison of performances and evaluation of advantages and disadvantages.

    1 Introduction

    The popularity of using Field Programmable Gate Ar-

    rays (FPGAs) to implement integrated circuits has been dra-

    matically increased in recent years. The prime advantages

    provided by FPGAs are their fast manufacturing turnaround

    time, low start-up costs and ease of design that involves less

    financial risks [8]. However, new challenges have emerged

    since the size of FPGAs has reached million gates level.The design and development using FPGAs suffer from the

    the large placement time as turnaround time is crucial [4].

    The placement problem has become bottleneck of the cir-

    cuit performance in FPGA. For the next generation of Com-

    puter Aided Design (CAD) tools for FPGAs, fast and qual-

    ity placement methods are critical.

    In this survey, we use the island style FPGA model [6].

    The generic structure of island style FPGA consists of four

    main parts: Configurable Logic Blocks (CLB), which are

    the basic logic blocks, implement the logic functions of the

    circuit. Input/Output Blocks (IOB) are the connections of

    FPGA and external devices. The connection block is used

    to connect a CLB to the routing channels while the switch

    block is used to connect the routing channels [6].

    In the placement step, the netlist of logic blocks is placed

    into FPGA circuit [6]. The optimization goal of placement

    is to place the blocks in a proper location so that the ob-

    jective function is minimized. There are three common op-

    timization criteria for placement, time-driven, wire-length-

    driven and path-driven. Time-driven placement attempts to

    minimize the delay in the circuit while wire-length-driven

    placement targets to minimize the total wire used. Path-

    driven placement focuses on trying to put the logic blocks

    on the critical path of the circuit so that both timing and

    wire can be optimized.

    In the following sections, the paper reviews four differ-

    ent categories of FPGA placement methods, compares their

    experimental results and analyzes the performances.

    2 Simulated Annealing Placement

    Simulated annealing placement mimics the annealing

    process used to gradually cool molten metal to high quality

    metal objects. An initial placement is created by randomly

    placing the logic blocks in the circuit. A large number of

    swapping blocks is made to gradually reduce the cost. In

    this section, the well-know Versatile Place and Route (VPR)

    tool using simulated annealing is reviewed [9].

    2.1. Overview of Simulated AnnealingAn annealing process is to allow molecules cool down

    in a controlled manner by temperature in order to find their

    best fit in the system. The simulated annealing algorithm is

    based on random movement of logic blocks, which is calledmove [6]. The cost function is defined to evaluate the

    quality of the placement and a linear congestion cost func-

    tion as following provides the best results in a reasonable

    computation time [9].

    Cost =

    Nnets

    n=1

    q(n)[bbx(n)

    Cav,x(n)+

    bby(n)

    Cav,y(n)]

    The cost is the summation over the bounding box of

    all nets in the circuit. For each block (xi, yi) in one

    1

  • 7/29/2019 603 Paper

    2/6

    Figure 1. Bounding Box of One Net.

    net, the coordinates of the bounding box is defined as

    (xmin, xmax, ymin, ymax). A bounding box of one net is il-lustrated in Figure 1. Note that the dashed line is the bound-

    ing box of a net consists of four blocks.

    Cav,x(n) and Cav,y(n) are the average channel capac-ities in the x and y directions over the bounding box of

    net n [6]. Also, a compensation factor, q(n) is used forwiring length under estimation introduced by the bounding

    box method.

    Simulated annealing starts with a random placement of

    each logic block in the circuit. After the initial placement, a

    certain number of moves are performed to see whether the

    cost is reduced or not according at certain temperature. If

    the cost decreases, then the move is always accepted. How-

    ever, if the cost increases, there is still probability for the

    move to be accepted. The probability is given by eC/T,

    where C is the change in the cost function the movecourses, and T is the temperature. This hill-climbing ability

    allows simulated annealing method not to converge to local

    minima and thus to reach global optimization [9].

    A good annealing schedule is essential to the final re-

    sults. With the motivation of increasing the amount of time

    spent at temperatures where a significant of moves are be-

    ing accepted, the following temperature update schedule is

    used in VPR.

    Tnew = Told

    where is defined as shown in Table 1. Note that Roldaccepteis the percentage of the move that has been accepted at the

    old temperature.

    Even with a good annealing schedule, millions of block

    swaps are evaluated at each temperature. The most time

    consuming and computationally intensive part is calculating

    the cost causes by the swap. It is crucial to make this part

    as fast as possible. VPR also uses some heuristics to speed

    up this process, such as using incremental net bounding box

    update and a changing range of distant limit [6].

    2.2. Pros and Cons

    Raccept

    Raccept > 0.96 0.5

    0.8 < Raccept 0.96 0.9

    0.15 < Raccept 0.8 0.95

    Raccept 0.15 0.8

    Table 1. Temperature Update Schedule

    There are several advantages of the simulated annealing

    placer. Fist of all, it outperforms the other placers as long

    as direct comparisons can be made [9]. The FPGA CAD

    tool VPR, which uses the simulated annealing method, has

    become the state of the art tool in this field. Second, simu-

    lated annealing placer has an open cost function which can

    be defined as either wire-length-driven, time-driven or path-

    driven. The cost function can also be the linear combination

    of the above types though it is hard to decide the weights.

    Third, simulated annealing can reach global minimum be-

    cause of the hill-climbing ability. However, simulated an-

    nealing is very slow because of the computationally expen-

    sive and time consuming evaluation of each move. Besides,

    due to the inherent sequential nature of simulated anneal-

    ing, it is very hard to be paralleled using multi-core CPUs

    or clusters.

    3 Quadratic Placement

    Quadratic placement method uses the squared wirelength as the objective function. It tries to minimize the cost

    by solving the linear equations [12]. Although quadratic

    placement only considers the squared wire length, it can ef-

    ficiently finish the placement process with almost no quality

    lost. As a result, it is widely used in the VLSI placement

    [12].

    3.1. Overview of Quadratic Placement

    The input file for a quadratic placer is a hyper-graph

    netlist and the process tries to minimize the total squared

    distances between every two nodes. The cost will be com-

    puted according to the formula:

    (x, y) =1

    2

    m

    i=1,j=1

    Wi,j [(xi xj)2 + (yi yj)

    2]

    The coordinates of the logic block in the netlist are x and y.

    Wi,j is the weight between node (xi, yi) and (xj , yj). Sincethe input is a hyper-graph, and two nodes can be connected

    by more than more net, there are two models to convert the

    hyper-graph into a graph [7].

    The objective function can be rewritten into a matrix no-

    2

  • 7/29/2019 603 Paper

    3/6

    tation:

    (x, y) =1

    2xTQx + dTx x +

    1

    2yTQy + dTy y + const

    where Q is an n n symmetric matrix and dx, dy are n-

    dimensional vectors.

    Because of the symmetric property, the objective func-tion can be separated into x dimension and y dimension re-

    spectively. Then the function looks like as following with

    only one dimension considered:

    (x, y) =1

    2xTQx + dTx x + const

    In order to find the minimum value, let (x) = 0which results in the following matrix equation:

    Qx + dx = 0

    This is the quadratic equation that minimizes the to-

    tal squared wire length and can be solved by using non-stationary iterative methods.

    The algorithm proposed in [12] can be divided into three

    stages.

    In the first stage, by repeatedly building up, modifying

    and solving linear equations, a good initial placement

    can be obtained. This stage is performed until no sig-

    nificant improvement can be achieved.

    In stage 2, instead of building and solving linear equa-

    tions, nodes can be directly moved to reduce the total

    wire length since stage 1 has already given a reason-

    ably good initial placement. The process in stage 2

    is much faster than stage 1 so more iterations can be

    performed to get a better refinement.

    Finally, simulated annealing can be used to further re-

    fine the placement with low temperature.

    3.2. Pros and ConsThe main advantage of the quadratic placement tech-

    nique is that it significantly improves the run time with al-

    most no quality lost compared to VPR. According to the

    results shown in [12], across the 20 MCNC benchmark cir-

    cuits [5], QPF runs 5.8 times faster than VPR on average

    while the wire length obtained by QPF is only 1.9% more

    than VPR. By using better algebra method to solve the lin-ear equation, the run time might be further reduced. How-

    ever, since the squared wire length is the only factor consid-

    ered in the objective function, the timing part of the place-

    ment can not be shown in the quadratic placement.

    4 Min-Cut Placement

    Partitioning-based placement algorithms have been fast

    and hence scalable for large Application Specific Integrated

    Circuit (ASIC) placement and have also been applied to

    FPGAs. One of the recent partitioning-based placement

    method, named min-cut placement, recursively applies bi-

    partitioning to map the netlist of a circuit into the FPGA

    layout region. It minimizes the number of cuts of the nets

    while in the mean time, leaves the highly connected logic

    blocks in one partition [3].4.1. Overview of MinCut PlacementDelay optimization is very important in circuit design.

    Effective delay minimization on large circuits is possible

    only by accounting for performance as early as possible in

    the design flow. Min-cut placement targets delay minimiza-

    tion on the placement stage, which is an early step in the

    design process.

    The min-cut placer employs the fundamental divide-and-

    conquer method. A circuit is recursively bi-partitioned in a

    breadth first manner as shown in Figure 2.

    Figure 2. Bi-partitioning Process of Min-Cut.

    The cut direction (horizontal or vertical) is decided based

    on the criticality of the nets crossing the four borders so

    that the total cut numbers are minimized [3]. This recursive

    process is repeated until each partition contains only a few

    blocks to group the highly connected blocks together in or-

    der to decrease placement cost. The goal of min-cut is to

    find a proper partition that cuts fewest wires in the net.

    All the edges in the net are weighted with timing crit-

    icality, as well as terminal alignment of critical nets [3].

    The algorithm can be divided into three stages. In the first

    stage, min-cut uses the state of the art multilevel partitioner

    hMetis [11] as its partitioning engine. During the partition-

    ing process, a tight connection between the circuit graph

    and placement is maintained, which represents coordinates

    of all blocks on the FPGA fabric. Recursive partitioning is

    done until each leaf partition has only a few blocks while

    in some cases, some leaf nodes might contain more nodes

    than it can accommodate, so overlaps must be removed. In

    stage two, overlaps are removed by using a greedy tech-

    nique, which moves blocks to the closest best aligned parti-

    tion. Finally, the placement is refined by using a low tem-

    perature simulated annealing method to further minimize

    3

  • 7/29/2019 603 Paper

    4/6

    the delay.

    4.2. Pros and Cons

    The advantage of the min-cut placement technique is that

    it minimizes the delay in the placement stage, which lays

    the foundation for designing a better performance circuit.

    Besides, the run time reported in [3] shows that an average

    3-4x speed up is gained compared to VPR on 20 MCNC

    benchmarks with a slight degradation in the quality. How-

    ever, the results of min-cut is relied on how well the par-

    tition is performed. Current research is focused on finding

    some heuristics to better partition the circuit. Also, min-cut

    placer may not be able to reach the global minimum because

    of some of the greedy strategies it uses.

    5 Parallel Placement

    As the scale of modern FPGAs has reached millions of

    logic blocks, more efficient and scalable FPGA placementalgorithms are needed. Parallelization is an appealing so-

    lution for providing fast placements due to the rapid de-

    velopment of multi-core CPUs in recent years. The par-

    allel approaches that we are going to review are based on

    simulated annealing since it outperforms the others while

    the main drawback is its time consuming move. We divide

    modern simulated annealing based parallel FPGA placers

    into three categories: parallel move approach, area based

    approach and deterministic parallel approach.

    5.1. Parallel Move Approach

    Since there are quite a large number of moves at each

    temperature, the motivation of the parallel move approachis trying to accelerate the simulated annealing process by

    performing several moves at the same time. There are three

    possible cases after each move is done. (i) two blocks are

    swapped (ii) a block is moved to an empty location (iii) the

    move is rejected. Moves can be done in parallel only if they

    do not move the same block or move to the same location.

    Figure 3 shows a simple example of the parallel move. Note

    that Move 1 and Move 2 can be done in parallel since they

    are totally independent while Move 2 and Move 3 can not

    because they are trying to move block 3 to different loca-

    tions at the same time.

    However, ensuring the above can only guarantee there

    are no move collisions while net cost collision might still

    happen. As shown in Figure 4, block 1 and block 3 belong

    to the same net. While move 1 and move 2 are done in par-

    allel, the resulting bounding box of move 1 is the bounding

    box of block 2 and 3 while the resulting bounding box of

    move 2 is the bounding box of block 1 and 4. Two moves

    that move blocks of the same net may evaluate the bound-

    ing box incorrectly as each one of the moves can not take

    into account the fact that the other move is changing the

    bounding box.

    Figure 3. Parallel Moves Approach.

    Generally, there are two ways to deal with the move col-

    lision and net cost collision.

    Ignoring the errors in the cost function is the easiest

    way to deal with collisions. But it has negative effectson the accuracy of the cost which interferes with the

    acceptance of moves. This adversely affects the re-

    sults.

    Find the disjoint moves that not only move different

    blocks, but also belong to different nets. The over re-

    stricted moves results in a smaller swap space and the

    synchronization overheads tend to overwhelm the gain

    in parallelism.

    Figure 4. Net Cost Collision.

    Both of these two methods show negative speedups [1].

    The reason is due to the overhead of synchronization out-

    weighs the advantages of parallelization. But the thought of

    trying to parallelize the moves inspires many other parallel

    FPGA placement methods.

    5.2. Area Based Approach

    The area based approach is motivated by solving the col-

    lision illustrated in the parallel move approach. It partitions

    the area of FPGA and assigns the partitioned areas to differ-

    ent processors. As shown in Figure 5, the whole circuit is

    4

  • 7/29/2019 603 Paper

    5/6

    Figure 5. Collision in Area Partitioning.

    partitioned into four parts, and each processor is in charge

    of one partition.

    The moves evaluated are much less restricted than the

    move parallel approach. However, collisions could still hap-

    pen because multiple processors may move blocks belong-ing to the same net across the partition as presented in Fig-

    ure 5. For example, the bounding box of block 1, 2 and 3

    can not be computed since they belong to different parti-

    tions. These errors can be tolerated because we do not ex-

    pect the net which spans over two or more partitions happen

    very often. Moreover, with cooling temperature, the swaps

    are tend to happen between nearby blocks.

    Since each processor can only move blocks within its

    own partitioned area, to allow the placement to reach global

    minimum, the partition must be carefully performed so that

    each block has the freedom to move to any arbitrary loca-

    tions in FPGA. The area based approach uses both horizon-

    tal and vertical partition to ensure global minimum could be

    reached [1].

    The experimental results show a non linear speed up has

    been gained compared to the sequential placer and the cost

    does not degrade with the increasing processors. This is due

    to the less synchronization requirements.

    5.3. Deterministic Parallel ApproachOne of the constrains of parallelism is the non-

    determinism of the results. This constrain is seldom studied

    in the past work (an exception is [2]), but is vital in a com-

    mercial context for the following two reasons [10].

    When user uses a commercial FPGA placement tool,

    he must be able to reproduce the problem when a bug is

    reported. Non-determinism makes this extremely dif-

    ficult because the results are different for each run.

    In the release testing stage of building a placer, it

    would be terribly difficult to look into failing tests

    since the results changed randomly.

    The algorithm proposed in [10] parallelizes the place-

    ment while at the mean time, keeps the results determinis-

    tic. The deterministic parallel approach partitions a move

    into two stages: processing and finalization. As shown in

    Figure 6, during the processing stage, each processor pro-

    posals a move and evaluates it. This takes the vast majority

    of time and thus occurs in parallel. In order to avoid colli-

    sions and maintain the deterministic property, the calculated

    moves are put in a queue and a dependency check is needed

    to ensure there is no collision and re-propose moves thathave collided. Note that the finalization part can be done by

    any of the idle processor. In our example, C0 is idle when

    all the moves in queue have been checked, thus C0 does the

    finalization job.

    Figure 6. Deterministic Parallel Approach.

    There are several advantages of the deterministic paral-

    lel approach. Firstly, speed up can be linear given the as-

    sumption that the finalization time is negligible. Secondly,

    a move is now processed entirely by one processor, which

    improves the memory locality. Thirdly, the results are de-

    terministic and serial equivalent.

    6 Future Work

    Algorithms for FPGA placement play a vital role in mod-

    ern integrated circuit design. As the comparisons of results

    show in this paper, placement is still bottleneck even though

    tradeoff can be made between quality and efficiency. The

    potential of improving both run time and quality still exists

    by using parallel methods[10].

    Ideally, we want to systematically compare the results of

    each placement algorithm. However, direct comparisons of

    these algorithms are difficult, partly because of the limited

    access to the algorithms, and partly due to their different

    assumptions. Comparisons based on wire-length placement

    have been attempted in [5]. More work is needed to build

    a common framework to directly compare the performance

    of different FPGA placers.

    7 Conclusion

    In this paper, a number of different FPGA placement

    algorithms are reviewed and a summary of comparison is

    5

  • 7/29/2019 603 Paper

    6/6

    Placement Quality Placement Efficiency

    Simulated Annealing Overall, it gives the best results particularly

    when using wire length as the cost function.

    The final placement result can reach global

    optimization.

    It is very slow because of the computationally

    expensive evaluation of each move.

    Quadratic Placement On average, it requires 1.9% more wire

    length compared to simulated annealingmethod. It only considers wire length factor

    in the cost function while timing part can not

    be shown.

    Compared to simulated annealing, it can be

    5.8 times faster on average.

    Min-Cut Placement The final result varies dramatically based on

    how the partition is made. To the best of our

    knowledge, no direct comparison of results to

    simulated annealing has been made.

    An average of 3-4 times speedup can be

    gained compared to simulated annealing.

    Parallel Placement In general, using the deterministic parallel

    placement, the results are as good as the sim-

    ulated annealing.

    The best speedup can be linear with the num-

    ber of processors if the finalization time is

    negligible.

    Figure 7. Overall Comparison of the Four Placement Methods

    shown in Figure 7.

    Simulated annealing placement in general, outperforms

    the other placers with regard to the final results. Besides, it

    can overcome local minimum and has open cost function.

    However, it is too time consuming. Quadratic placement

    gives fast run time while the results can not reach global

    optimization and timing factor can not be shown in the cost

    function. Min-cut method can also give speed up on run

    time while the quality of the placement still can not be guar-

    anteed. Parallel algorithms can give good speedups andwith almost no quality lost. But the scalability is restricted

    by the overhead due to memory.

    References

    [1] A.Nayak A.Choudhary, M.Haldar and P.Banerjee.

    Parallel algorithms for fpga placement. Proc. of the

    10th Great Lakes Symposium on VLSI, pages 8694,

    2000.

    [2] J.Chandy S.Kim B.Rankumar, S.Parkers and

    P.Banerjee. An evaluation of parallel simulated

    annealing strategies with application to standard cellplacement. TCAD, 16:398410, 1997.

    [3] P.Maidee C.Ababei and K.Bazargan. Time-driven

    partitioning-based placement for island style fpgas.

    IEEE Transactions on Computer Aided Design of In-

    tegrated Circuits and Systems, 24:395406, 2005.

    [4] J.Cong D.Chen and P.Pan. Fpga design automation: A

    survey. Foundations and Trends in Electronic Design

    Automation, 1, 2006.

    [5] S.Yildiz M.Markov I. Villarrubia, P.Parakh and Mad-

    den. Benchmarking for large scale placement and be-

    yond. Proc. of the International Symposium on Phys-

    ical Design, pages 95103, 2003.

    [6] V.Betz J.Rose and A.Marquardt. Architecture and

    CAD for Deep-Submicron FPGAs. Kluwer Academic,

    1999.

    [7] N.Viswanathan and C.Chu. Fastplace: Efficient ana-

    lytical placement using cell shifting, iterative local re-finement and a hybrid net mode. Proc. of ISPD, 2004.

    [8] S.Brown and J.Rose. Fpga and cpld architectures: A

    tutorial. IEEE Design and Test of Computers, 12:42

    57, 1996.

    [9] V.Betz and J.Rose. Vpr: A new packing, placement

    and routing tool for fpga research. International Work-

    shop on Field Programmable Logic and Applications,

    pages 213222, 1997.

    [10] A.Ludwin V.Betz and K.Padalia. High-quality, deter-

    ministic parallel placement for fpgas on commodity

    hardware. ACM/Sigda Int. Symp. on FPGAs, pages1423, 2008.

    [11] R.Aggarwal V.Kumar, G.Karypis and S.Shekhar. Mul-

    tilevel hypergraph partitioning: Application in vlsi do-

    main. Proc. ACM/IEEE DAC, 1997.

    [12] Y.Xu and M.A.S. Khalid. Qfd: Efficient quadratic

    placement for fpgas. International Conference on

    Field Programmable Logic and Application, pages

    555558, 2005.

    6