High Performance FPGA Designs
Kartik Subramanian Iyer
Nusrat Ali
Date: Nov-03-2006
Copyright Notice
This document contains proprietary information of HCL Technologies Ltd. No part of this document
may be reproduced, stored, copied, or transmitted in any form or by means of electronic, mechanical,
photocopying or otherwise, without the express consent of HCL Technologies. This document is
intended for internal circulation only and not meant for external distribution.
Table of Contents
1. Introduction................................................................................................................. 3
2. Coding Guidelines for Good Synthesis results ........................................................... 3
2.1. Identify critical Blocks................................................................................................ 3
2.2. Limiting levels of logic............................................................................................... 3
2.3. Multiple Clocks Design and Clock Enable................................................................. 3
2.4. Single clock edge to clock data................................................................................... 4
2.5. Registered outputs from each leaf-level block............................................................ 4
2.6. Reset Strategy ............................................................................................................. 4
2.7. Proper partition ........................................................................................................... 4
2.8. Design for Testability ................................................................................................. 5
2.9. Resources Used........................................................................................................... 6
2.10. FIFO Uses ........................................................................................................... 6
2.11. Core Generator.................................................................................................... 6
2.12. Xilinx specific components................................................................................. 6
2.13. Device architecture ............................................................................................. 7
3. Core-Gen/Third Party IP Integration in Synthesis...................................................... 8
3.1. Core-Gen FIFO IP core support in Synplify –PRO.................................................... 8
3.2. Xilinx implementation perspective ............................................................................. 9
4. Analyzing Timing Reports.......................................................................................... 9
5. Implementation Options and Guidelines for ISE...................................................... 10
5.1. Translate Properties .................................................................................................. 10
5.2. Map Properties .......................................................................................................... 10
5.3. Place & Rou te Properties ......................................................................................... 11
5.4. Multi-Pass Place-and-Route...................................................................................... 12
6. Guideline for MAP & PAR Options......................................................................... 12
7. Guideline for Placement using Floor-Planner........................................................... 13
7.1. Area grouping constraints ......................................................................................... 14
8. Example of Pin Locking Constraints ........................................................................ 14
9. Common P&R & Map Errors ................................................................................... 15
10. Plan Ahead advantage............................................................................................... 17
10.1. PlanAhead Flow................................................................................................ 18
1. Introduction
This paper shares the Guidelines/Tips for writing High Performance FPGA designs. It also shares
the Authors experiences of designing High Performance DDR2 Controller IP. The paper covers all
aspects of FPGA designs starting with RTL coding, Map and Place & Route.
2. Coding Guidelines for Good Synthesis results
Following are some RTL coding guidelines for achieving high performance in FPGA.
2.1. Identify critical Blocks
The famous 80-20 rule holds good here also. In most of the cases it is the 20% of the design
(blocks) which fails in timing and creates problem for the complete design. These blocks should
be identified when creating the design document. Most of the times these blocks are Counters,
State machines, Decoding, and Data path logic. Always register their outputs before using them
in some other block/logic.
2.2. Limiting levels of logic
The designer should have rough idea about the levels of logic the design will tolerate for
achieving the desired frequency.
Tip: Around 5-8 levels of logic can achieve 250 MHz in Vertex4.
2.3. Multiple Clocks Design and Clock Enable
Design should be partitioned properly to make sure that the entire clock crossing logic is only in
one Block. It also reduces the effort in creating the Timing Constraints.
Use clock enable instead of gated clocks. Using clock enable saves clock resources and can
improve timing characteristics and analysis of the design.
To gate entire clock domains for power reduction, it is preferable to use the clock-enabled global
buffer resource (BUFGCE) whereas for applications that only attempts to pause the clock for a
few cycles on small areas of the design, the preferred method is to use the clock-enable pin of the
FPGA register.
Not suggested coding style Suggested coding style assign GATECLK = (IN1 & IN2 & CLK); assign ENABLE = (IN1 & IN2 & LOAD); always @(posedge GATECLK) always @(posedge CLOCK) begin begin if (LOAD) if (ENABLE) OUT1 <= DATA; DOUT <= DATA; end end
2.4. Single clock edge to clock data
Multiple edges can create problems in meeting the timing specially when there is some logic in
between two edges and two flops working on the different edges are routed far apart.
2.5. Registered outputs from each leaf-level block
The output from each major partition block, State Machine/FIFO’s should be registered. This is
quite helpful when modifying the design for better performance. This also provides flexibility in
routing and floor planning.
2.6. Reset Strategy
Synchronous reset in Xilinx devices allows better performance for the following reasons.
• Prevent the use of synchronous elements of dedicated hardware blocks
Example: Both multiplier blocks and RAM registers contain only synchronous resets. if an
asynchronous reset is coded for these functions, the registers within these blocks cannot be used. This
has a severe effect on performance.
• Prevent optimizations of the logic inside the fabric
• Severely constrain placement and routing because reset signals often have high fanout
performance.
• Prevent the use of a device library component, such as shift register look-up table (SRL)
Example: Reset cannot be described in the code when inferring performance-optimized shift
registers (SRL) because the SRL library component does not have a reset. Using resets in
code that infers shift registers requires either several flip-flops or additional logic around the
SRL to allow a reset function
2.7. Proper partition
The synthesis tools do not optimize the logic across the hierarchy as efficiently as they optimize
the logic in one hierarchy. More over at several occasions designers want to retain the hierarchy
to help them understand the implementation better. Here are few guidelines that will greatly help
in achieving high performance.
1. Partition the logic based on their interaction. On a general guideline not more than 300 line
of code should be kept in one partition and the output and input should be registered. If it is
not possible, always at least register the outputs.
2. Keep the related logic together for better optimization. Especially if you have are using some
FIFO’s, State-machine etc. Make sure that all the related logic is in one block. This helps you
to route the block independently.
3. Place all I/O components including any instantiated I/O buffers, registers, DDR circuitry,
SerDes, or delay elements on the top-level of the design. If it is not possible to place them on
the top-level, ensure that they are all contained within a single hierarchy.
4. Any logic in which the synthesis tool employs resource sharing should be contained within
the same hierarchy.
5. Manually duplicate registers with high fan-outs at hierarchy boundaries.
Tip: Avoid glue logic at the top level
2.8. Design for Testability
• Avoid tri-state bus as there are limited numbers of tri state resources available in FPGA. If
you have to use tri-sate buses then to ensure testability, pass the enable of the tri-state bus
through AND gate so that scan_enable signal can control the tri-state bus.
• Use multiplexer logic at the output of the derived clock before it fed to the input of another
flip-flip. Make the other input of Mux as the primary clock and the select line as
“scan_enable”. This will make sure that the primary clock is used during testability
• In Power Savvy designs/Gated clock design, add OR gate after the AND gate and add
scan_enable as another input to the OR gate in addition to the output to the AND gate.
• In case of derived reset/internally generated reset. Add “scan_enable” signal to the other
input of the “OR” gate. In test mode, asserting “sacn_enable” make sure that the
asynchronous reset is disabled to avoid losing any data in scan mode.
2.9. Resources Used
Always think about the resource requirement, routing issues, frequency requirements before
coding the logic
Let us say there is a requirement to model the grant logic where the grant has to be provided after
8 clocks of the request and there can be max 4 outstanding requests at a time. This could be
easily modeled using a shift register but when the same delay reaches 64, you may want to use
dedicated SRL’s available in Vertex devices as they will be efficiently utilizing the resources.
When the same delay reaches 256 or more the SRL implementation may not be very suitable as
this will occupy lot many LUT’s and may increase the routing delay for the other related logic as
they will be spread far across the LUT’s better approach will be to use a FIFO.
2.10. FIFO Uses
Always register the FIFO outputs i.e. FIFO empty and FIFO full signals. In case they are not
registered and used in some other logic available in some other RTL block which is placed far
apart from the FIFO contained block then the routing delay can have serious impacts.
2.11. Core Generator
The Xilinx CORE Generator™ tool comes with many basic corers which are quite useful and
timing efficient. These cores could be considered for following reasons.
• Synthesis Tool is not inferring the proper resources.
• Synthesis does not meet the timing/area requirements.
Ready to use proven cores are needed to save engineering time and money
2.12. Xilinx specific components
Xilinx provides some ready made components. Make use of available components for better
device utilization and performance. Ensure that you include these library files during the
synthesis stage along with your HDL code for the design.
For example some components for a DDR design are glbl, IDDR, IDELAY, IDELAYCTRL,
IOBUF, and ODDR. Designer should have some idea of the Set up requirements of FPGA
primitives/components used. For example the set up requirements for DDR registers are quite
high
Tip: Register the outputs and inputs to the third party IP and Xilinx components
2.13. Device architecture
Always keep in mind the device structure before coding the RTL logic. Vertex 4 has a column
architecture where the FIFO’s and block RAM’s are arranged in a column. The architecture
knowledge helps in efficiently utilizing the resources/performance.
Figure 1: Virtex 4 Device Architecture
Example: If a RTL block interacts with multiple FIFO’s and it also interacts with the I/O Pins
then the routing delays for the RAM/FIFO located far away from the I/O Pins will be more. In
this scenario you may want to want to reconsider the decision of using FIFO if the FIFO size is
small. The below figure explains the issue
BUFR /
BUFIO
SLICE
LOGIC FIFO
BLOCK
RAM DSP
IDELAY
- CTRL
ODDR/
IDDR
Figure 2: Routing delay when placed FIFO is far apart
The Figure 2 shown above explains a scenario in our DDR2 design where the ODDR/IDDR pin
were locked and the near by Block RAM were occupied causing the desired block RAM to be
placed far apart, causing significant routing delay.
The following changes helped us achieve the frequency.
The desired FIFO (placed far apart) in our case was small migration to distributed RAM helped
us. The dedicated I/O registers causing timing violation was moved from the I/O to FPGA fabric
to manually place the registers following steps must be performed.
• Disable any global I/O register placement options for the synthesis tool
• Specify whether the register should be placed into the I/O by adding an IOB=TRUE in
the UCF file or source HDL code
• Disable the Map option "Pack I/O Registers/Latches into IOBs" in ISE Project Navigator.
This disables automatic pushing of registers into the I/O
Tip: Disable global packing of registers into I/O cells. Instead, only constrain registers for which timing
is critical on the printed circuit board to be packed into the FPGA I/O cell.
3. Core-Gen/Third Party IP Integration in Synthesis
3.1. Core-Gen FIFO IP core support in Synplify –PRO
Xilinx Core generator creates structural EDIF Net lists (Xilinx/EDIF version 2.00) with both .ndf
and .edn filename extensions. The .edn and .ndf net list files will be used in the ISE translate
stage
ODDR/
IDDR
FIFO
In case the design is using Core-gen FIFOs then they can be declared as black boxes and the .edn
and .ndf net list files can be used in the translate stage. The Synplify tool can also read the EDIF-
formatted files generated by the Xilinx Core Generator reflecting the black box contents.
Note: In cases where part of the design is available in net list format the synthesis tool will not
be able to optimize the interface that efficiently though it will use the .ndf and .edn file in
generating the timing report.
3.2. Xilinx implementation perspective
Most of the cores generated by the Core-Gen tool are a combination of .edn and .ndf files. An
.ndf file is a Xilinx binary file equivalent to an .edf file and it has only LUT functionality
(conveys only resource and timing information) and can only be read by Xilinx software whereas
the .edn file has complete functionality (used both for logic implementation and for
communicating resource and timing information).
During the translate stage of the design all the net list files generated by the Core-gen tool for
the IP-core need to be added to the project file of the design.
Note: These include the files with extensions .edn, .ndf, and any other lower level net list files in
the hierarchy failure to add any one of these files will result in implementation errors during the
translate stage.
4. Analyzing Timing Reports
Logic Level Timing Report gives the first measure of the design performance .Following are the
important points to consider.
• Allow a margin of extra 20% for the delays reported by the synthesis report since routing
delays are estimated.
• Do a simple synthesis with just the clock constraints and observe the results adding compile
time is going to be large and the Implementation Tools may not be able to reach your timing
goal (be very AWARE of this).
• Use the Post Layout Timing Report to verify that your constraints were met by the
Implementation Tools. This is easier than opening the Timing Analyzer on a very large
Virtex design, which might take a couple minutes.
• Use the Timing Analyzer to generate detailed timing information about your design. The
Timing Analyzer will provide a wealth of timing information on designs that use timing
constraints. Unconstrained designs will generate a Default Path Analysis that is only slightly
helpful.
• The Report Paths in Timing Constraints Report shows each constraints delay path in
descending order of slack. The Report Paths Failing Timing Constraints report shows each
failing delay path.
• The Custom Report shows all the delay paths between groups of path endpoints created by
selecting Sources and Destinations. This report can be used find the timing information for a
particular delay path without having to review a large report.
• The Report Paths Not Covered Report shows the all of the delay paths in the design, in
descending order of length. This report can be used to find any unconstrained delay paths.
• The Timing Analyzer reports can show users how many levels of logic are being inferred.
This is very important, since most designers are not aware of how much logic they are
generating with their synthesis tool, or how much optimization the synthesis tool is doing for
them. If your delay path infers multiple levels of logic, it will have to be re-synthesized (with
code changes or different synthesis option settings) to meet your timing objective.
• Hide the unwanted messages reported in the ISE timing report by setting filters in ISE
5. Implementation Options and Guidelines for ISE
5.1. Translate Properties
Make sure that the LOC Constraints box is enabled if you already have some LOC constraints in
the UCF file.
5.2. Map Properties
� Timing-Driven packing and placement: The timing-driven packing option uses the timing
constraints to guide the packing of critical path logic into slices. It insures that the critical
paths are placed and routed before other non-critical paths.
Tip: Try timing-driven packing with a regular PAR effort level of High first; then, try the
extra-effort level starting with Normal.
� Map Effort Level: It is better to start from a standard effort level. if the design does not
meet the timing-requirements then try using a High map effort with a Normal Extra Effort.
� Combinatorial Logic Optimization: Enable this option to remove any extra (un-used)
combinatorial-logic.
� Register Duplication & Global Optimization: Be careful in using these options as you may
over constrain the implementation tool to perform a lot of actions on the design & the
implementation tools may issue an error.
Replicate Logic to Allow Logic Level Reduction: Register replication increases the speed of
critical paths by making copies of registers to reduce the fan-out of a given signal Enable this option
to potentially improve timing results.. Manual replication can also be tried if the tool is not able to
replicate the logic. This increases the area.
Tip: Enable this option when the high fanout nets with long route delays are reported as critical
paths in the timing reporting
Manual replication of High Fan out Net (*EQUIVALENT_REGISTER_REMOVAL="NO"*) reg signal_1, sihnal_2; always @(posedge clk) begin signal_1 = signal1_high_fan_out; sihnal_2 = signal1_high_fan_out; end always @(posedge clk) begin if (signal_1) data_out[7:0] <= data[7:0]; if (signal_2) data_out[15:8] <= data[15:8]; end
Note: Many times an additional synthesis constraint needs to be added to ensure that a manually
duplicated register is not optimized away by the synthesis tool. In the above example, the XST syntax
was used (EQUIVALENT_REGISTER_REMOVAL).
5.3. Place & Rou te Properties
� Place & Route Mode: The default value is Standard. the value of Standard will give you the
fastest run time but the least effort in meeting your timing objectives. The value of High will
give you the most effort at meeting your timing objectives at the expense of increased run
time. Try effort level Standard, then Medium, then High as final choice.
� Placer Effort Level: For shorter runtimes of the tool, it is better to choose a Medium effort
level.
� Router Effort Level: Router effort level can be increased to High from Medium, this
improves the timing when there are significant amount of timing violations because of
routing delays. This may typically improve the timing by about 5 %
Tip: The routing and timing are largely based on the placement of logic. Therefore, it is
usually most beneficial to use a High effort level for placement and limit the routing effort
level to Standard. While the quality of the routing is based on the placement, the best
placement will not always produce the best timing.
� Extra effort: It is better not to choose this option unless you have tried all possible
implementation strategies and are still not able to meet your timing objectives. Enabling this
option results in significantly long run times of the tool. This may also result in P & R errors
if you have already enabled other implementation options like Timing driven packing &
placement during the MAP stage.
5.4. Multi-Pass Place-and-Route
Multi-Pass Place-and-Route is the part of the Xilinx tool set that fully implements the design
based on a cost table (often referred to as a "seed"). 100 different cost tables can be attempted.
Each one will provide a fully implemented design with a different placement (and different
routing), which provides different timing.
� Using Multi-Pass Place-and-Route is a very time-consuming task and should be used only
after nearly all other options have been attempted.
Tip: Start with a low placer cost table value (preferably default value) and generally try
using a small value for the Number of PAR Iterations (about 3 to 5) For example if target is
for 5 iterations of PAR in the MPPR mode with An option of saving the results from the 3
best runs. this should give you. A fair idea of the improvement in timing achieved through
each run.
� For most designs, you can expect a 15 to 20% difference in performance between the very
best and the very worst cost tables. Typically, you might gain a 5% improvement over a
normal place and route.
� With reference to the bulleted points on placement mentioned in the previous section you can
run many different placements and, once you have found the best placement, then increase
the routing effort level to High to finish the routing on the best one or two (placements).
6. Guideline for MAP & PAR Options
1. Ensure that the clocks are routed on Global/dedicated clocks resources. This reduces clock skew
which minimizes Hold Violations possibility increasing the reliability of the design.
2. Route the Reset, Set on the global routing resources. Use the Global Set/Reset (GSR) resources
to reduce the skew on a set/reset in older device families. Don’t use the GSR in Virtex. The GSR
has too much delay and general interconnect will distribute this signal quickly.
3. Provide proper Max Skew attribute in the UCF on control signals that are routed on general
interconnect and are having high fan out.
4. In both the cases the designer has to be aware of the constituent of the routed (placed) block else
routing (placement) will fail. Always provide a margin of 20% logic elements for any design to
be (placed) routed this reduces the (placement) routing tool run time and increases the chances of
a successful route.
5. At times the logic being developed could be intermediate (part of a bigger design). Following are
the points which need to be taken care in such cases.
6. Generally the I/O is having large delays which will not happen in the real design where the input
could be registered output of some other block. In such cases I/O to first register could be
declared as False Path.
7. There could be designs where the intermediate design could be having more number of I/O than
available in the Device. Place and Route will fail in such cases. Design wrappers where all the
Input could be registered and could be fed to a Mux. The Select line of the Mux could be
generated by a free running counter. The output of the Mux could be fed to the real design. This
insures that the wrapper is not synthesized away. Declare the I/O to register, as false path.
Figure 3 : Wrapper for ISE implementation
7. Guideline for Placement using Floor-Planner
The co-ordinates of the block to be placed can be specified in terms of X & Y coordinates on the
FPGA device by manually locating the co-ordinates or by dragging the placed block on the device-
editor and the let the tool generate the slice co-ordinates automatically. In doing so it has to be
insured that the placed co-ordinates contain all the logic required for the block. It is good to have
20% margin.
Output
Register
Stage
DDR –
Controller
IP
Free-
running
counter
Input
Register
Stage
Input
Register
Stage
Output
Ports
Input
Ports
Internal
Inputs
Multiple outputs
Memory interface signals
DDR
Wrapper
DATA-
MUX
False path
False path
False path
The rough resource estimate of the block to be placed can be taken from the Map report (.mrp) and
the rough estimate of the resource contained in the placed co-ordinates can be had by looking at the
device architecture and the %age area occupied by the co-ordinates.
Tip: Specify coordinates diagonally like e.g., take X0Y0 as one coordinate and X10Y10 as another,
this will ensure that all FPGA - resources (slices / FIFOs/RAMs/ DSPs) Within this area are
available to the Placer to place the logic in the design.
Example: inst “u_mem_intf” RANGE = SLICE_X0Y247:SLICE_X80Y167;
Note: The draw back with this approach is the approximation involved in choosing the co-ordinates.
The approximation becomes further complex when you have common logic which is overlapping
between 2 blocks.
7.1. Area grouping constraints
The preferred method of placing related logic on an FPGA is to use Area Grouping constraints.
If Area Group is attached to a hierarchical block, all sub-blocks in the block is assigned to the
group. Once defined, an AREA GROUP can have a variety of additional constraints associated
with it to control its implementation. All these AREA GROUP constraints should be specified in
the UCF file.
Example:
AREA_GROUP "AG_mem_intf_grp" GROUP = OPEN;
AREA_GROUP "AG_mem_intf_grp" PLACE = OPEN;
AREA_GROUP "AG_mem_intf_grp" RANGE = SLICE_X0Y247:SLICE_X80Y167;
AREA_GROUP "AG_mem_intf_grp" RANGE = RAMB16_X0Y21:RAMB16_X0Y30,
RAMB16_X1Y21:RAMB16_X1Y30, RAMB16_X2Y21:RAMB16_X2Y30;
INST "u_ddr_mmr" AREA_GROUP = "AG_mem_intf_grp”;
INST "u_ddr_controller" AREA_GROUP = "AG_mem_intf_grp”;
INST "u_alg_dmapio_mux" AREA_GROUP = "AG_mem_intf_grp”;
INST "u_alg_dmapio_arbiter" AREA_GROUP = "AG_mem_intf_grp”;
INST "u_alg_synchronizer" AREA_GROUP = "AG_mem_intf_grp”;
INST "u_addr_fifo" AREA_GROUP = "AG_mem_intf_grp”;
As can be seen from the example above the RANGE for resource usage can be set.
Tip: Using AREA_GROUP constraints like GROUP & PLACE we can include & place logic
that is outside the AREA_GROUP with the logic which is within the AREA_GROUP. To enable
this set AREA_GROUP as OPEN.
8. Example of Pin Locking Constraints
There are 3 possible ways in which the pin-locking can be done.
1. The pin locking LOC constraints can be specified within the UCF (User-Constraints file).Here
user has to manually write the desired co-ordinates in the UCF File.
2. The pin-locking can be done through PACE editor where the tool itself generates the co-
ordinates for the placed block.
3. The pin-locking can be done through the design browser available with the Plan Ahead software
suite. The Plan Ahead is not part of ISE and requires a separate License. The tool also takes care
of built-in DRC.
Examples:
IDELAY Control related pin locking constraints
INST "u_ddr_controller/ddr_idelayctrl_0" LOC=IDELAYCTRL_X0Y0;
BUFG related pin locking constraints
INST "u_ddr_controller/u_BUFG_IDELAYCTRL" LOC=BUFGCTRL_X0Y0;
ODDR related pin locking constraints
NET “ddr2_dq_out[71]” LOC = “T36”;
IDDR related pin locking constraints
NET “MEM_DM[8]” LOC = “M37”;
Here MEM_DM is an inout type of port.
9. Common P&R & Map Errors
IDELATCTRL Uses: When instantiating only one IDELAYCTRL, in the HDL design code the
LOC constraints are not required but when Instantiating multiple IDELAYCTRL LOC constraints
are required else the tool will report error. The Reference Clock (REFCLK) port of the
IDELAYCTRL should be driven by the global clock buffer (BUFGCTRL) else the tool will report
error
Tip: Always provide Loc constraint to the IDELAYCTRL primitive used even if the design uses only
one IDELAYCTRL .It will decrease the power consumption and resource area
ODDR Uses: In the design for outputs with bit-widths greater than one and that are driven by
ODDR instances. we must ensure that for each bit of the signal output a Separate ODDR instance is
being driven. ODDR use is explained below.
Wrong Code
output [`DDR2DS_WIDTH -1 :0] ddr2_dqs_out ;
//---------------------------- Internal Wire Declarations --------------------------------------
wire [`DDR2DS_WIDTH -1 :0] ddr2_dqs_out ;
wire [`DDR2DS_WIDTH -1 :0] ddr2_dqs_out_int /* synthesis syn_keep=1 */ ;
reg dqs_en_in_reg /* synthesis syn_preserve=1 */ ;
wire mem_dqs_in /* synthesis syn_keep=1 */ ;
ODDR #("SAME_EDGE",0,"SYNC") U_dqs0_oddr (
.C ( MEM_CLK_in ), // in
.CE ( dqs_en_in_reg ), // in
.R ( 1'b0 ), // in
.S ( 1'b0 ), // in
.D1 (1'b1 ), // in
.D2 (1'b0 ), // out
.Q ( mem_dqs_in ) // out
) ;
assign ddr2_dqs_out_int = mem_dqs_in ;
assign ddr2_dqs_out = ddr2_dqs_out_int ;
always @ (posedge core_clk_in )
begin
dqs_en_in_reg <= dintrf_memintrf_wrdat_en_out;
end
In the above code the output of a single ODDR viz. mem_dqs_in is being used to drive a vectored
output port. Logically this is not possible as it amounts to packing multiple outputs to a single
ODDR. Under these circumstances the implementation tools will issue an error.
Correct Code
reg dqs_en_in_reg /* synthesis syn_preserve=1 */ ;
wire [`DDR2DS_WIDTH - 1:0] mem_dqs_in /* synthesis syn_keep=1 */ ;
generate for (i=0; i < `DDR2DS_WIDTH ; i= i+1)
begin : dqs_Test
ODDR #("SAME_EDGE",0,"SYNC") U_dqs0_oddr (
.C ( MEM_CLK_in ), // in
.CE ( dqs_en_in_reg ), // in
.R ( 1'b0 ), // in
.S ( 1'b0 ), // in
.D1 (1'b1 ), // in
.D2 (1'b0 ), // out
.Q ( mem_dqs_in[i] ) // out
) ;
end
endgenerate
assign ddr2_dqs_out_int = mem_dqs_in ;
assign ddr2_dqs_out = ddr2_dqs_out_int ;
always @ (posedge core_clk_in )
begin
dqs_en_in_reg <= dintrf_memintrf_wrdat_en_out;
end
In the code above, this is achieved by using a for loop within a generate statement please observe
below that the mem_dqs_in signal is being declared as a vectored signal.
Note: When instantiating IDELAYCTRL without LOC constraints, the implementation tools auto
replicates IDELAYCTRL instances throughout the entire device, even in clock regions not using the
delay element. This results in higher power consumption due to higher resource utilization (uses one
global clock resource in every clock region) , and a greater use of routing resources. There are eight
global clock lines per regional clock domain
10. Plan Ahead advantage
Following are the major challenges (limitations) of the ISE Place and Route Tool.
1- The user has to make a rough approximation of the resources required for the block placed.
2- The user has to also make an approximation of the total area (logic) where the desired block is
being routed and has to insure that the routed area can accommodate the routed logic.
3- One of the major disadvantages with the ISE place and route engine is that when you have
designs with overlapping logic and you try assigning area grouping constraints to such designs
then invariably the design fails at the MAP stage during the implementation-phase. The area for
placement (co-ordinates) has to be increased further in such cases.
PlanAhead provides a solution to the above mentioned issues. It improves performance of the design
by reducing the route delay in the design through floor-planning. It provides deep insight of the
routing issues and allows the user to decide about the appropriate placement of the logic.
It can hierarchically partition the design into smaller, more manageable physical blocks (called as
Pblocks). It maintains a physical hierarchy that is independent from the logic hierarchy. This enables
Pblocks to include logic modules and primitive logic from anywhere in the logic hierarchy. Critical
or associated logic can be tightly grouped together into a single Pblock preventing logic migration
thus limiting interconnect lengths and reducing delays.
10.1. PlanAhead Flow
The Plan-Ahead tool sits between synthesis and the ISE place and route tools. Any FPGA
synthesis tool, targeting Xilinx FPGAs, can be used for your design. The Plan-Ahead tool uses
the synthesized net list and design constraint files for analysis. The tool allows you to export an
EDIF net list and a single design constraint UCF file to drive the ISE tools.
Figure 4: FPGA Design flow using PlanAhead
Following are ways by which Floor Plan can be created.
1. The net list from synplify-pro along with the relevant user constraints file (UCF) can directly
be input to the PlanAhead tool to create a new floor plan.
2. However If the design has been run through ISE, the results can help in floor plan creation.
The ExploreAhead tool within PlanAhead can be used to load existing placement into the
PlanAhead floor plan.
Figure 5 : Typical Placement view in PlanAhead
Floor Planning with PlanAhead: Floor planning is an iterative process. To begin with one can
create a new Pblock containing the critical (violating the timing) path logic and place it near to the
interacting logic. We can use the “show connectivity” feature in the PlanAhead to find the
appropriate place for the Pblock.
Figure 6 : Using the Show Connectivity command
Run TimeAhead2 after the placement, if the violations disappear with the new placement then save
the UCF constraints and run ExploreAhead.
Tip: PlanAhead has an embedded static timing analysis engine and environment called TimeAhead.
This provides a good and fast approximation of the timing for the placed design. The analysis can be
run with zero interconnect delays or with estimated delays.
In case the above approach does not work and there is lot of common logic involved in the Pblock
creation then you may have to repartition larger Pblocks into smaller Pblocks.
Tip: As a general rule, the smaller the logic constrained in the Pblock the more predictable it
becomes.
The below figure explains the re-partitioning of Pblock.
Step-1: Select the Top Module of the Design and let us say it contains three sub blocks receiver, led
and channel as shown in the figure-6. Create Pblocks for them pblock_receiver , pblock_led and
pblock_channel. Place them and run the TimeAhead tool.
Step-2: After running the TimeAhead if there is any violation as shown in figure-7 then the Pblock
containing the critical path need to re-partitioned. In this case the step-1 has to be repeated by setting
the top as pblock_receiver (since this contains the critical path).
Tip: As a rule of thumb The Pblocks with the heaviest Bundle nets should be placed - close together.
Figure 7:Initial Pblock Placement Figure 8 :Refined Pblock Placement
ExploreAhead: PlanAhead contains a tool called ExploreAhead which allows multiple
implementation attempts using various ISE command options.
Users can create and save ISE “Strategies”, which are a set of option configurations for each ISE
implementation command. These various Strategies are then applied to Floor plans for
implementation using ISE. Users can monitor progress, view log reports and quickly identify and
import the best implementation results.
11. Conclusion
Though Synthesis tools are getting more and more advanced but still sound understanding of the
design and proper planning always pays, especially in the design having large gate count, multiple
clocks and higher speed.
Top Related