Design and verification of an efficient WISHBONE-based ... and veriﬁcation of an efﬁcient...

Computers and Electrical Engineering 40 (2014) 1838–1857

Contents lists available at ScienceDirect

Computers and Electrical Engineering

journal homepage: www.elsevier .com/ locate /compeleceng

Design and verification of an efficient WISHBONE-basednetwork interface for network on chip q

http://dx.doi.org/10.1016/j.compeleceng.2014.05.0060045-7906/� 2014 Elsevier Ltd. All rights reserved.

q Reviews processed and recommended for publication to Editor-in-Chief by Associate Editor Dr. Saraju Mohanty.⇑ Corresponding author. Tel.: +1 306 966 5456.

E-mail address: [email protected] (S.-B. Ko).

K. Swaminathan a,b, G. Lakshminarayanan a, Seok-Bum Ko b,⇑a Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli, Indiab Department of Electrical and Computer Engineering, University of Saskatchewan, Saskatoon, Canada

a r t i c l e i n f o

Article history:Received 11 November 2013Received in revised form 13 May 2014Accepted 14 May 2014Available online 7 June 2014

a b s t r a c t

In this paper, a generic asynchronous First In First Out (FIFO) based WISHBONE compatibleplug and play Network Interface (NI) for Network on Chip (NoC) is designed and verified.Four different types of encoded asynchronous FIFOs namely binary, Gray, one-hot andJohnson are designed and analyzed. It is found that Gray-code asynchronous FIFO is thebest to handle the asynchronous clock domain issues in NI. The control signals of theWISHBONE bus wrappers from/to asynchronous FIFOs and packing/unpacking modulesare asserted concurrently at the same rising edge of the respective router and IP clocksto reduce the latency. The same NI has been utilized for transferring data between synchro-nous as well as asynchronous clock domains irrespective of clock frequency and phasedifferences. The proposed NI ensures the seamless high data throughput between therouters and IP cores with minimal latency, higher throughput, higher speed and utilizedlesser area compared to the existing design.

� 2014 Elsevier Ltd. All rights reserved.

1. Introduction

Increasing the number of reusable Intellectual Property (IP) cores and more than a billion transistors in a Multi-ProcessorSystem on Chip (MPSoC) design in the Nano-electronic integrated circuits era brings us ever increasing design and testingchallenges [1–3]. The interconnection delays in a bus-based communication are rapidly increasing compared to gate delays.It results in performance degradation and synchronization problems between IPs in MPSoC, if the number of IP cores isincreased [2,3]. The NoC architectures have been suggested as a promising solution for highly scalable, reliable and modularon-chip communication infrastructure platform [2,3]. The NoC design represents a new paradigm to design MPSoC whichshifts the design methodologies from computation-based to communication-based [3]. The NoC architecture uses layeredprotocols and packet-switched networks which consist of on-chip routers, links and Network Interfaces (NIs) on a predefinedtopology. The development of complete application-specific NoC for MPSoC is a challenging process that requires the pre-dominant definition for suitable network topology, protocols and crossbar switches which demands adequate design flowsto minimize design time and effort along with the design cost. Interfacing IP cores with different data width and frequency toNoC is a critical task due to its asynchronous nature. Connecting different IP cores with NoC router using NI is a complex taskdue to its asynchronous clock domain nature, different data width, assembling and disassembling of packets. Therefore, it is

http://crossmark.crossref.org/dialog/?doi=10.1016/j.compeleceng.2014.05.006&domain=pdf

http://dx.doi.org/10.1016/j.compeleceng.2014.05.006

mailto:[email protected]

http://dx.doi.org/10.1016/j.compeleceng.2014.05.006

http://www.sciencedirect.com/science/journal/00457906

http://www.elsevier.com/locate/compeleceng

K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) 1838–1857 1839

essential to develop a plug and play generic NI to handle Clock Domain Crossing (CDC) issues in order to pass the data at highrate between different clock domains without loss [4].

A new micro-architecture of NI for NoC has been implemented utilizing OpenCores WISHBONE [5] bus and it enablesshort design time and offers seamless high throughput data flow in this paper. The key contributions of the proposed workare:

� The proposed NI works as dual purpose NI which can interface synchronous as well asynchronous IPs with routers irre-spective of clock frequency and phase differences between the modules. The data width and FIFO Size can be modified asper the application requirement.� The micro-architectural level merging of the bus wrappers with the respective packing/unpacking modules and the asyn-

chronous FIFO offer the latency free bus wrapper to achieve high speed data transaction on NI between various processingIP cores and NoC router.� Different encoded asynchronous FIFO schemes such as binary [6], Gray [6], Johnson [7] and one-hot [8] are designed and

analyzed. The proposed NI design utilized the best asynchronous FIFO namely Gray encoded FIFO among the four.� A low latency packing and unpacking unit of the proposed NI offers efficient assembling by inserting the packet fields

such as routing information, payload details and disassembling by extracting packet fields at a fast rate. The optimumlatency of the entire NI is two clock cycles which are limited by packing, unpacking and asynchronous FIFO modules.� The proposed NI is verified using coverage driven constraint random based verification environment [9].

The data transaction from a router end bus wrapper to an asynchronous FIFO (transmit FIFO) and an IP end bus wrapperto an asynchronous FIFO (receiver FIFO) of the proposed design has been done without any latency due to micro-architecturelevel merging mechanism. The read operation has been done between the respective FIFOs and the bus wrappers via packingand unpacking modules with a latency of one clock cycle. This has been achieved by concurrent sampling of data and controlsignals of the sub modules belong to the identical clock domain at the same edges of the respective clocks. The proposedgeneric NI has better efficiency, higher throughput, less latency which offers a simple and flexible connection mechanismto utilize single processing core with the router directly as well as multiple processing cores with memories and peripheralsconnected through the other standard System on Chip (SoC) commercial buses to the routers; the connection is establishedirrespective of the different frequencies and phases among them.

This paper is organized as follows: Section 2 describes an extensive literature survey of NI for NoC. In Section 3, an over-view and essential requirements of NI for NoC are given. The packet format of NoC used in this proposed design and themetastability issues when connecting different clock domain subsystems are explained. Section 4 describes the salient fea-tures of various asynchronous FIFO using different encoding schemes. Section 5 describes the features of WISHBONE buswith read write operations and the implementation of WISHBONE compatible asynchronous FIFO based NI architecturefor NoC is explained. In Section 6, the constraint driven BFM based verification environment utilized to verify the proposedNI is explained. Section 7 deals with the performance comparison of WISHBONE compatible NIs results using differentencoded asynchronous FIFOs. Finally, in Section 8, the conclusion of the proposed scheme is discussed.

2. Related work

Several studies have been explored on NI implementation for NoC to overcome the asynchronous problem and standard-ization of NI fabricates to improve speed and throughput [4,10–19]. Generally the implementations have been proposedbased on Direct Memory Access (DMA) with asynchronous FIFO [4,10], Globally Asynchronous Locally Synchronous (GALS)[11–13], Advanced Microcontroller Bus Architecture (AMBA) [14] and AMBA Advanced extensible Interface (AXI) bus [15],Open Core Protocol (OCP) [16–18] and asynchronous FIFO [19].

In [4], the authors proposed a simple generic programmable based NI architecture which offers rapid plug and play inter-facing of IPs to routers with minimal performance overhead. The packet maker (PM) and packet disassembler (PD) unit of NIhandle the header phrasing, payload correction and routing path determination. Apart from asynchronous FIFO the aboveimplementation utilizes extra memories namely PM memory and PD memory. In [10], the authors proposed Network Pro-cessor Array (NePA) platform utilized DMA based generic master core and slave core NI with buffered mode and un-bufferedmode. Simultaneous read/write operations do not take place without sufficient delay in the same FIFO/buffer when workingon different clock domains, however either the read or write operation can be done at a time. The extra memory usage andthe complex controller design results in high latency and area overhead compared to the proposed design.

In [12], the authors proposed synchronous/asynchronous dual mode on-chip and off-chip interfaces utilized the Grayencoded FIFO based GALS NoC architecture to resynchronize between the synchronous and asynchronous NoC. Theoff-chip/on-chip NoC interface used the mixed synchronous/asynchronous dual mode NoC port composed of two distinctasynchronous to synchronous (A-to-S) and synchronous to asynchronous (S-to-A) interfaces. Each S-to-A and A-to-S virtualchannels (VC0 and VC1) made up of two Gray encoded FIFOs per channel utilizes bundled-data handshake protocol whichresults in area overhead of two extra Gray encoded FIFOs. Later the same authors proposed another asynchronous FIFO solu-tion, claiming that the Gray code presents limitations which are complex in implementation, encoding of the only powers oftwo, problems in pointer increment, and extra logic blocks used to convert binary to Gray [13]. As the new solution, Johnson

1840 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) 1838–1857

encoded FIFO was suggested instead of Gray encoded FIFO for area efficient design. This GALS based delay-insensitive4-phase protocol network adaptor is used to achieve a higher throughput, also provides Dynamic Voltage Frequency Scaling(DVFS) capabilities by using a local programmable clock generator and the GALS adapter is used as a hard-macro for bettertiming control and easy top level integration. The proposed architectures have advantages of handling synchronous andasynchronous NoC packets at a higher data rate, however there is no information about packing and unpacking unit to han-dle NoC packet transactions in NI. The proposed design in this paper offers higher speed than the above mentionedimplementation.

In [14], a low latency AMBA based Master Network Interface (MNI) and Slave Network Interface (SNI) with 4-phase,2-phase and credit based flow control mechanism is proposed. Simultaneous read and write operations have not performedin a same memory due to the utilization of flow control mechanisms in the above design without asynchronous FIFOs pro-vide higher latency and lower throughput. In [15], the network architecture exploits AXI transaction based protocol to becompatible with existing IP cores. The above mentioned new NI architecture provides a novel dynamic buffer based on var-iable packet size for improving the resource utilization and performance of the NI and NoC. Master and slave NI architectureswith Reorder-Packet Table (RPT) and Reorder-Buffer (RB) which provide high resource efficiency with little hardware over-head to create enough space to store the incoming out-of-order packets have been implemented. OCP IP standard bus basedNI is implemented with basic and precise/imprecise burst mode extensions to speed up the NI with NoC router transactionsand integration process [16]. A comparative study has done between handshake based flow and credit based control flow onmaster and slave network adaptor. The above implementation utilized separate request and response module with a hand-shake and credit based flow control results higher latency compared to our proposed design.

In [17], a low latency NI using an OCP IP interface with pausable clock in order to reduce the power dissipation of NI isimplemented and a hibernate switching technique is used while no communication is available. This offers a smooth com-munication between OCP IP and NoC routers, by packing of OCP transactions to form NoC flits and converting NOC flit intoOCP transactions, the computation of routing information and the flits buffering of packets to improve performance with asignificant power reduction. In [18], the authors designed an OCP compatible NI which utilizes three different burst modesbased on the nature of burst data length namely Precision Burst (PB), Imprecise Burst (IB) and Single Request Multiple Databurst (SRMD) with credit based and handshake flow control mechanisms. This design utilized memory sharing techniquesfor area gain and two level gated clock techniques for area reduction. These OCP compatible NI mainly used credit based andhandshake protocols with MNI and SNI. The above implementations offer higher latency and low throughput due to creditbased and 4-phase handshake protocol compared to the proposed design in this paper.

In [19], the authors proposed dual clock asynchronous FIFO based reliable NI which synchronized multiple packets fromdifferent sources to a single destination with packing and unpacking unit capable of handling wormhole switching with X–Yrouting. However, the design did not use any standard bus protocol to connect entire bus based SoC IP with NoC router.

In this paper, a generic dual purpose WISHBONE compatible NI has been proposed to handle synchronous as well asasynchronous data transfer between IPs and NoC router. The latency free bus wrappers and less latency of one clock cycleutilization of the packing and unpacking modules of the proposed NI offer a low latency high throughput data transactionssuch as multimedia streaming and high speed peripherals. This is achieved by the merging architecture of bus wrapper andpacking/unpacking module and merging the FIFO input signals to the respective bus wrapper module. As per the authors’knowledge this is the first comprehensive attempt of implementing WISHBONE compatible asynchronous FIFO based NIfor NoC in the router side interface as well as in IP interface side with configurable packing and unpacking module. The pro-posed asynchronous NI can easily adapt any range of frequencies irrespective of packet or flit size and clock phasedifferences.

3. Network interface requirements of NoC

3.1. Network interface

The NoC is composed of routers, NIs and links. NoC router/switch transports the data from incoming ports to outgoingports to connect the adjacent router or NI. The NI module is connecting the processing element to the NoC router. NIs con-verts request and response transaction into packets and vice versa [20]. Links provide physical connections between adjacentrouters.

Network Interface

Packing Async FIFO

IP C

ore

Port

NoC

Rou

ter

Loc

al P

ort

UnpackingAsync FIFO

Fig. 1. General block diagram of NI for NoC.


A physical link can support more than one logical link or channels. The success of the NoC design is based on the stan-dardization of the interfaces between IP cores and the interconnection fabric. The general structure of NI is shown in Fig. 1which performs the following functions [20]:

� Writing/reading the flits to/from the processing core and the router vice versa.� Packing the incoming signals from the IP cores by assembling the number of flits, header information, flit type insertion

with exact number of flits and unpacking the signals coming from the router as per the IP core specification.� Transferring data from one clock domain to another clock domain without loss.

3.2. NoC packet format

A message is a contiguous group of bits that is delivered from the source terminal to the destination terminal. A messageconsists of packets. A packet is the basic unit for routing and sequencing. Packets may be divided into flits. A flit is the basicunit of bandwidth and storage allocation. Flits are divided into header flit, body flit and tail flit. Header flit consists of routinginformation about its current source address, destination address and sequence information. Body and tail flits do not haveany routing or sequence information and have to follow the route for the whole packet. The flit is again divided into phits(physical transfer digits). Phit is transferred across a channel in a single clock cycle. These resource allocation units are han-dled in different layers of the network protocol for different purpose. Flits and phits are handled in the physical layer of NIutilized for synchronization of data transfer purpose. There are no specified standard sizes about the resource allocation unit[1]. The packets in the proposed work consist of flit with 40-bit width, and the header flit is located at the first position of thepacket contains information about the source and destination address, size of the packet and any other application specificcommands. The detailed description of the packet is shown in Fig. 2. These packets must be traversed from router clockdomain to processing clock domain vice versa, without any data loss.

3.3. Metastability

A fundamental problem in digital systems is the lack of a global timing reference, called synchronization [21]. Metasta-bility is a fundamental problem which causes system failure in digital devices when interfacing between circuitry in unre-lated or asynchronous clock domains [22] and is caused by registers not meeting the setup (Tsu) and hold time (Th)requirements at the active edge of the clock signal. Synchronized methods are used to avoid or totally suppress the proba-bility of metastability. The synchronization failure probability can be reduced to an acceptable range by carefully designedsynchronizer [6,22]. The simplest and safest solution to avoid metastability problems in an asynchronous clock domain is touse flip-flop, double cascaded synchronizer, triple synchronizer and multi cascaded flip-flops [6]. The asynchronous FIFOs ofthe proposed NI consist on the double synchronizer to avoid metastability.

Fig. 2. Resource allocation unit of proposed NI for NoC.


4. Asynchronous FIFO features

4.1. Asynchronous FIFO

Asynchronous FIFO refers to the FIFO where the clock at the reading side of the FIFO buffer and the clock at the writingside of the FIFO buffer are in different speed and phase, in which the clocks are asynchronous with reference to each other.Asynchronous FIFOs are used to transfer the data from one clock domain to another domain without any loss in the data [6].This requires a memory architecture which has two memory ports, one for input (or write or push) operation and another foroutput (or read or pop) operation. FIFO pointers are used to keep track of the read and write locations and also prevent theoverflow and underflow of the FIFO buffer. FIFOs have the inherent characteristic of synchronizing itself to the read and writepointers.

Different encoding schemes such as binary, Gray, Johnson and one-hot are used to encode read and write pointer to passthe data from one clock domain to another clock domain to avoid metastability.

1. Binary encoding: The characteristic of a binary counter is that the number of bits changing per transition is not constantand in half of the cases more than two. In a 4 bit binary counter when the transition is happening from 0111 to 1000, allthe 4 bits are changed, and this may result in 4 metastable conditions [6]. In this scenario it is impossible to predict themetastable condition. The pointer value synchronized with other clock domain may become entirely different thanintended. This is the biggest drawback of using binary counters as FIFO pointers [6].

2. Gray encoding: Gray code encoding scheme is the safest counter that can be used in multi-clock designs. It allows only onebit to change for each clock transition, eliminating the problem associated with trying to synchronize multiple changingCDC bits across a clock domain. The drawback of using Gray codes is that it can only be designed for mod (2n) counters [6].One more drawback is that while using Gray pointers arbitrary multi-bit values cannot be passed; it must be either incre-mented or decremented [6]. This gives very low latency compared to other encoding schemes [7].

3. One-hot encoding: This uses a ring like structure for encoding scheme. It requires N flip-flops for generating N states [8].This makes the FIFO depth to be varied over any number without being restricted to the power of 2 as in Gray codes [6].

4. Johnson encoding: Johnson counter for the read and write pointer encoding is similar to Gray in terms of the change in asingle bit per transition but it can represent any number of states as a multiple of two compared to Gray which can rep-resent only 2n states [7].

4.2. Generation of full and empty signals

The FIFO size may be optimized at a value that is not a power of two while using Gray and binary encoding. The Graycoding uses extra logic which costs more area and lower performance. Toggling of multi bits in binary encoded FIFO imple-mentation, the read and write pointer will result in wrong sampling of pointer comparison [6]. Johnson encoding and one hotencoding for read and write pointers rectifies those limitations and complexities of Gray and binary encoding of FIFO point-ers [6–8]. Johnson encoding and one-hot encoding are other codes with a Hamming distance of 2 between consecutive ele-ments which allow a safe synchronization of the pointers and use minimal combinational element for implementation. TheFIFO is said to be full when the read pointer catches up with the write pointer and the FIFO is said to be empty when thewrite pointer catches up with the read pointer [6]. Pointers must be one bit larger than needed to address the FIFO memory.The wrap around condition is detected utilizing Most Significant Bit (MSB) of pointers of binary, Gray, one-hot and theJohnson counters. The comparison unit checks for the equality of the read and write pointers excluding the MSB bit. If bothpointers are equal, then the FIFO must be either empty or full. If they are equal, then it checks for the MSBs of the pointers. Ifboth are equal, then both pointers have wrapped the same number of times so, the FIFO is empty. If both MSBs are not equalthen the FIFO is full [6–8].

5. Working of WISHBONE compatible network interface using asynchronous FIFO

5.1. OpenCores WISHBONE bus

The detailed WISHBONE SoC interconnection architecture is shown in Fig. 3. It offers a flexible design methodology tointerconnect semiconductor IP cores. The main purpose is to foster design reuse by alleviating SoC integration problems. Thisis accomplished by creating a common interface between IP cores. This improves the portability and reliability of the system,and results in faster time-to-market for the end user [1]. Commercially available SoC buses such as Advanced MicrocontrollerBus Architecture (AMBA 2.0), IBM CoreConnect, STMicroelectronics STBus, Sonics SMART Interconnect, Altera Avalon andOCP buses are not royalty free as that of WISHBONE bus. The end user does not need to depend on particular implementationtools and methodologies; this is available in the public domain [5] and it may be freely copied and distributed by any means.

The NI consists of two master/slave modules, packing, unpacking and two asynchronous FIFO Modules as shown in Fig. 3.There are two types of possible transactions which take place in NI, namely router to IP core and IP core to the router.

Fig. 3. Detailed diagram of WISHBONE compatible NI for NoC.


5.2. WISHBONE master and slave

The master/slave interface module at the router side generates the required signals to write the data into the FIFO andread out the data from the packing module which generates the required signals to translate the packing module signalsto WISHBONE compatible signals and also generates the necessary control signals to write the data from NoC router to FIFO.

Similarly the master/slave interface module at the processing core side generates the required signals to write the datainto the asynchronous FIFO and read out the unpacked data from the unpacking module and generates the required signalsto translate the unpacking module signals to WISHBONE compatible and generates the necessary control signals to write thedata from processing core to asynchronous FIFO. This interface will perform single/block read/write operations on both sides.The WISHBONE reset (rst_i) signal is active LOW and is used to reset the IPs and the bus.

5.2.1. Signal and notation description

i. ‘_i’ and ‘_o’ denotes input and output signals.ii. ‘rtr_’ denotes the signal related to the router.

iii. ‘rtr_s_wb’ denotes the signal related to router with the WISHBONE slave.iv. ‘rtr_m_wb’ denotes the signal related to router with the WISHBONE master.v. ‘ip_s_wb’ denotes the signal related to router with the WISHBONE slave.vi. ‘ip_m_wb’ denotes the signal related to router with the WISHBONE master.

vii. ‘cyc’ the cycle signal when asserted, indicates that a valid bus cycle is in progress.viii. ‘stb’ the strobe signal indicates a valid data transfer cycle.

ix. ‘we’ the write enable signal is asserted LOW during READ cycles, and is asserted HIGH during WRITE cycles.x. ‘ack’ indicates the termination of a normal bus cycle by slave device.xi. ‘sel’ select signal indicates location of valid data is expected on the ‘data’ signal array during READ cycles, and where it

is placed on the ‘data’ signal array during WRITE cycles. The bus configuration is achieved by utilizing the ‘sel’ signal.

The single READ/WRITE cycles are the basic modes of data transfer in WISHBONE bus. The master begins the READ cycleat the rising edge of clock edge. At that time it places an address onto the address bus ‘adr_o’, de-assert the ‘we_o’ line intoLOW to indicate a READ cycle, drives the data select line ‘sel_o’ high or low depending upon the location that is being read,and asserts ‘stb_o’ and ‘cyc_o’ to indicate the start of a new bus cycle. The WRITE cycle is similar to the READ cycle. However,in this case the MASTER asserted ‘we_o’ signal to high and presents valid data out on ‘dat_o’ at the beginning of the cycle. Inresponse, the SLAVE asserts ‘ack_o’ when it is ready to latch the data at the next rising edge of the clock.

5.2.2. WISHBONE slave data write to asynchronous FIFODuring the reset state, the ‘ack_o’ signal is asserted LOW with respect to router clock. A write operation is performed

when ‘we_i’ is set as HIGH and FIFO is not FULL. This asserts ‘ack_o’ to HIGH which indicates a write transfer from the exter-nal master to internal slave write is permissible, and then the data bus (dat_i) directly connected to the FIFO writes data


when operation is not allowed when ‘we_i’ and FIFO’s FULL flag is asserted as HIGH. This indicates the writing of data isimpossible when the FIFO is full. The ‘dat_o’ and ‘add_i’ are not used during this writing operation. The write operationon asynchronous FIFO by WISHBONE slave is identical in NoC router side as well as the processing core side. The controlsignals of the asynchronous FIFO and the WISHBONE bus wrappers are merged together at the micro-architectural leveland concurrently sampled in a single clock edge to get rid of bus wrapper latency.

5.2.3. WISHBONE master data read from the packing moduleDuring the reset state, the entire output signals ‘cyc_o’, ‘dat_o’, ‘stb_o’, ‘we_o’ from WISHBONE master are asserted as

LOW during the rising edge of router clock. When ‘ack_i’ is HIGH and ‘fifo_empty’ is LOW, this asserts ‘fifo_read’ as HIGHfor reading data from packing module and ‘fifo_read’. During this time the output signals ‘cyc_o’, ‘dat_o’, ‘stb_o’ and‘we_o’ are asserted as HIGH. ‘cyc_o’, ‘dat_o’, ‘stb_o’ and ‘we_o’ signals are asserted as LOW when ‘fifo_empty’ is HIGH.The ‘‘flit_type’’ contains the information about the type of flit which is connected to ‘sel_o’ of the WISHBONE master. Thecontrol signals functions are same as mentioned in the write operation. Thus, the latency of the NI is limited by the pack-ing/unpacking modules and the asynchronous transmit/receive FIFOs.

5.3. Packing and unpacking module

The packing and unpacking modules perform the header phrasing, assembling and disassembling of the incoming andoutgoing data as per the router and processing core frequency, packet size and data width.

5.3.1. Packing moduleThe packing unit collects data from asynchronous FIFO and the necessary information of an individual packet [4] such as

source address, destination address, flit size and other field as per packet format as mentioned in Fig. 2, then it transfers thepacket to the WISHBONE master. The tasks of the packing module are,

� To form the packet header with necessary routing information required by the routers and the destination IP core.� Inserting flit type information whether the flit is a header, body or tail.� The exact sequence number of flits is assigned in each flit and the packet size is assigned in the packet size filed.� The incoming data from the asynchronous FIFO are shaped into the packet as per the flit size in the next stages.

The Finite State Machine (FSM) of packing module is shown in Fig. 4. The FSM consists of ‘idle’ state to bring all the outputsignals to a known state. The state transition of every state depends on ‘‘fifo_empty’’ signal and ‘‘ack_i’’ signal from WISH-BONE master interface. When the ‘‘ack_i’’ is HIGH, it initiates the ‘header insertion’ state. During this state the header infor-mation of NoC packet is formed by inserting flit type, packet size, routing information (i.e. source and destination address),the sequence number and all fields are assigned as per the flit format of NoC. The packet size, type and count are updated inthe header insertion state. The proposed design offers flexibility to configure all the fields of the packet. When the FIFO is notempty, it initiates the read operation by asserting ‘‘read_out_p’’ signal as HIGH. Now, the state transition is mainly donebased on the ‘ps_count’ (flit_count), fifo_empty and ack_i signal. When ‘ps_count > 1’ and fifo_empty is not empty thepacking module read the data from the FIFO and send the data to WISHBONE bus during ‘body_flit’ state. The read operationintroduced the ‘wait_state’ when ‘fifo_empty’ is asserted HIGH and ‘ack_i’ is asserted LOW. The read operation is formed assingle or block read based on the packet size of the header. On the completion of single or block read when ps_count = totalflit size �1, the data flow reaches the ‘tail_flit’ state. This cycle repeats for every incoming packet. The data width of the flit isconfigurable as per the router and IP core requirement.

5.3.2. Unpacking moduleThe unpacking unit receives data from asynchronous FIFO and extracts the header and data information. Based on the

packet size and control information, the data transfer is carried out by the respective modules in the local IP core throughthe IP port via WISHBONE master interface.

The FSM of unpacking module is shown in Fig. 5. The flit transaction depends on ‘‘fifo_empty’’ signal from asynchronousFIFO, ‘‘ack_i’’ signal from WISHBONE master interface and flit type value of each flit. When the ‘‘ack_i’’ is asserted HIGH andFIFO is not empty, this initiates the ‘header extraction’ state. The header data are extracted and the respective flit countervalue is updated. The local address generator generates the address for each read from the FIFO and the address is transferredto address fields of WISHBONE master to track the word count. When the flit count is greater than one and flit_type = 2’b01,the data flow reach the ‘Body_flit’ state and read all the data from the FIFO. When flit_type = 2’b11, the state is transferred to‘Tail_flit’ state to read the last flit of the packet. The data transfer to the WISHBONE is either a single body flit or blocks ofbody flits based on packet size. It introduces the wait state during ‘body_flit’ state when the FIFO is empty and ‘ack_i’ is LOW.The flit type information is also transferred to the IP core through the ‘sel_o’ signal of the WISHBONE master. When the statetransition reaches the tail state then it again goes to idle or header extraction states depending on ‘‘fifo_empty’’ and ‘‘ack_i’’signal.

Fig. 4. FSM diagram of packing unit.


6. Verification environment for NI using verilog

6.1. Functional verification

The Functional Verification (FV) with respect to verification methodology plays a significant role in verifying IP modulefor a reliable RTL design due to the growth in complexity of ASIC designs. FV is the process of checking whether a designsatisfies expected functional specification requirement. The two earlier methods of verifying the correct function of a businterface based on hardware components create a test bench and a larger system with other known-to-work componentsthat will create or respond to bus transactions. Creating a test bench for different transaction in bus based interface is a verybig and time consuming task. It involves describing the connections and test vector for all different combinations of bustransactions. Creating a system with another register based interface component describing the connections of DUT and pro-gramming the other component to generate the various bus transactions performing inward and outward transaction basedon the DUT response. Such a system usually involves creating and compiling code, storing the code in memory for the com-ponents to read, and generate the correct bus transactions. Bus Functional Simulation (BFS) simplifies the verification ofhardware components that attach to a bus provides the ability of generating bus stimulus without the need of going throughthe previously described approaches [23].

Fig. 5. FSM diagram of unpacking unit.


6.2. Bus functional model

A Bus Functional Model (BFM) is a behavioral non-synthesizable model of an IP core or integrated circuit componenthaving one or more external buses/processors. BFMs are exact mimic of the hardware device functions such as statemachine that executes the bus operations, timing information, interrupt cycles and specific bus or processor orientedfunctions to simulate system bus transactions prior to implementation of the actual hardware modules [23,24]. BFMsare usually written as tasks/functions using Hardware Description Languages (HDLs) or software languages such as C,C++, SystemC, SystemVerilog, Synopsys OpenVera, Property Specification Language (PSL) and Cadence Specman ‘‘e’’.BFM architecture is instantiated within the test bench to drive the signals into the DUT as a driver according to busprotocol and samples response signals to a monitor in the verification environment. Verification engineers need to knowonly the address of the bus/processor registers and the bus operation. Knowledge of the target device architecture, instruc-tions, registers, and ports is not required when using BFM components. The use of a BFM allows control over bus trans-actions, transaction spacing, and the ability to simulate abnormal transactions, such as aborts, retries, and errors [23]. Amaster BFM generates bus transactions based on master bus protocol to which the DUT is connected as a slave to respondto master signals. BFM components of a bus interface can generate stimulus or respond to bus transactions. A slave BFMresponds to bus transactions that the master DUT generates. The monitor BFM reports any errors regarding the bus com-pliance of the DUT in master mode and slave mode respectively as shown in Fig. 6. The verification is done for individualand top modules of DUT utilizing the WISHBONE BFMs.

Some important formal verification processes to ensure the functional verification of the design are listed as follows [24].

Test bench

Stimulus generation and DrivingResponse monitoring and Checking

Test bench wrapper

Design Under Test(DUT)

Wis

hbon

e M

aste

r Wishbone Slave

Wishbone Slave BFM

Wishbone Master BFM

Wishbone Master BFM

Wishbone Slave BFM

Test Code(Global Packages -Task and Functions)

Wis

hbon

e sl

ave

Wishbone m

aster

Fig. 6. Verification environment of NI for NoC.


i. Identification: Identify suitability of applying FV and the nature of subsystems whether it is sequential, concurrent,control or data path block in the design to verify by utilizing directed and random test cases. The nature of the clockto the subsystems also identifies and categorizes the number of subsystems that are operated in synchronous modeand asynchronous mode.

ii. Formal test planning process: A complete test plan needs to be created stating what is to be verified and how it is to beverified. The formal properties need to be defined in terms of generic behavior, independent of particular input sce-narios in terms of the minimal correctness criteria. The formal test plan should have verification requirements, possiblesignal transition is expressed in term constraints and test plans might use formal coverage targets. The languagerequirement for each subsystem should be planned earlier. The verification strategies need to be defined and the orderof the sub blocks also listed to ease the regression process. The concise hierarchy must be maintained among the sub-systems in the top module to avoid the dependency issues between the subsystems.

iii. Define interface: Individual sub module interface signals, internal signal and signals of interest to be monitored arelisted in a table to determine completeness of the requirement of checklist during the review process. The asynchro-nous and synchronous interfaces of all subsystems must be listed out. The verification environment shown in Fig. 6 isused to verify the WISHBONE based asynchronous NI consists of a test bench top module having the following threeparts 1. Master and slave BFMs. 2. Testbench top module having group of test cases 3.

6.2.1. WISHBONE master BFMThe master bus function initiates the data transactions as per master bus protocol to the slave mode connected device.

Reset bus, single read task, single write task, block read task, block write task are important operations in wishbone bus func-tional model. The master BFM performs only write operations in the proposed verification plan requirement.

6.2.2. WISHBONE slave BFMThe WISHBONE slave BFM performs single and block read operation and responds to the data transactions of the master

with the specified data size and range. The slave BFM performs only read operations in the proposed verification environment.Reset task, delay insertion task are available common to WISHBONE master and slave BFMs.

6.2.3. Verifying individual modules

� Reset conditions of all subsystems are verified.� The asynchronous FIFO plays a major role in the proposed NI. The full and empty conditions of the Individual FIFOs are

verified by simultaneous read write operations with various ranges of clock speed and different sizes of data burst.� Individual WISHBONE master and slave interfaces are verified by performing single read/write, multiple read/write oper-

ations and monitor the control signal status against the expected response.


� The packing module is verified with possible combinations of header flits, data flits and tail flits.� Checked the module responses based on header, payload and tail of the unpacking module during header extraction stage.

6.2.4. Verifying top level module

� The following scenarios are verified in the overall integrated verification environment.� The data, status and control signals are verified during a reset state of entire NI.� Performed single read and block read operation with and without acknowledgment signal and check the responses.� Performed single write and block write operation with and without requiring write enable signal and check the responses.� Performed write operation with a different data width (8, 16, 32 and 64 bits) and read the same and check the data with

the expected values.� Performed write operation with different block size (8, 16, 32, 64, 128, 256 and 512 word block) and read the same and

check the data with the expected values.� Exercise all the test cases keeping read frequency greater than write frequency and vice versa.� Performed all the above operations with different network interface and router frequency.

6.2.5. Design under testThe complete RTL description or Gate level net-list of a system under verification is called design under test. The DUT gets

stimulus from the generator via driver and the response checking module which checks the output of the DUT for correctoperation as per design specification. The stimulus driving the DUT can be generated in many different formats and from dif-ferent sources under different scenarios, but it is primarily focused towards exercising the DUT output with a known response.

Fig. 7. Simulated wave form of all types of asynchronous FIFO.


7. Results and comparisons with existing work

7.1. Simulation results

The simulated waveforms for the read clock, write clock, input data, output, and internal read and write pointers, emptyand full signals of all four types of asynchronous FIFO using ModelSim are shown in Fig. 7. The read and write clock signals ofthe differently encoded asynchronous FIFO are connected together with a single read and write clock signal in the testbench.

The speed ratio is depicted as 2:3 in Fig. 7. The read and write clocks of different FIFOs are connected together with asingle clock in the top level module of the testbench. Similarly a range of different read, write clocks was varied and checkedthe output responses. The individual reset signals of read and write clocks of all FIFOs are not shown in Fig. 7 for simplicity;however, individual FIFO clocks are connected with ‘‘read_clk’’ and ‘‘write_clk’’.

The data input signal of Gray encoded FIFO is driven with 1, 2, 3, 4, 5,. . . series, the one-hot encoded FIFO input signal isdriven with 2, 4, 6, 8, 12,. . ., the Johnson encoded FIFO input signal is driven with 3, 6, 9, 12, 16,. . . and the binary encodedFIFO input signal is driven with 4, 8, 12, 16, 20,. . .. The expected data of different FIFOs are available at the respective outputdata signals.

7.2. Synthesis results

The WISHBONE compatible NI is synthesized by using Synopsys Design Compiler targeted to STMicroelectronics COR-E90GPSVT90 nm CMOS standard cell library with nominal corner at 1.0 V and 25 �C. The asynchronous FIFO size/depth isthe major impact on the performance of the NI. The speed, area and power analysis of the different types of FIFO has beendone with different FIFO size. The result is shown in Table 1 and Figs. 8, 9, 11, 13 and 15. The performance of the NI utilizingGray encoded FIFO is shown in Table 2 and Figs. 10, 14 and 16.

7.2.1. SpeedPerforming static timing analysis (STA) in asynchronous clock domains is very crucial to estimate the exact speed of the

system due to its frequency and phase relationship. The following two important constraints must be set to achieve correcttiming of the asynchronous FIFOs based designs [25].

Table 1FIFO depth vs. area, power and clock speed.

FIFO TYPE FIFO depth Read clock speedof FIFO in MHz

Write clock speedof FIFO in MHz

Total areain lm2

Power Total powerin mW

Static in lW Dynamic in mW

Gray encoded FIFO 4 2631 2564 3626.47 4.327 4.9705 4.9748278 2439 2500 7183.792 8.313 8.7135 8.721813

16 2439 2380 14748.45 18.43 15.4881 15.5065332 2439 2222 28737.36 35.334 27.4771 27.5124364 2325 2173 56074.19 65.558 52.9033 52.96886

128 2222 2000 110545.9 126.709 91.8059 91.93261256 2173 1960 219445.4 249.6877 174.6189 174.6189

Binary encoded FIFO 4 2564 2564 3820.746 4.912 5.2729 5.2778128 2040 2040 7270.502 8.296 7.305 7.313296

16 2040 2040 14376.36 16.998 13.8411 13.858132 2040 2040 28016.24 31.811 25.5333 25.5651164 2040 2040 55787.71 64.6983 50.4765 50.4765

128 2040 2000 110075 123.726 96.6541 96.77783256 2040 1923 218229.2 243.633 194.8761 195.1197

Johnson encoded FIFO 4 2325 2272 3763.67 4.688 4.5286 4.5332888 2325 2222 7320.992 8.653 7.9897 7.998353

16 2222 2222 14835.16 17.78 15.1618 15.1795832 2222 2173 29056.76 33.715 28.8107 28.8444264 2222 2083 58084.99 68.749 56.7653 56.83405

128 2222 1923 108399 113.694 142.431 142.5447256 2222 1754 231719.8 277.301 223.9204 224.1977

One-hot encoded FIFO 4 2380 2325 5481.414 4.635 4.7079 4.7125358 2325 2272 7731.494 9.566 8.4814 8.490966

16 2325 2222 15723.12 19.628 16.0092 16.0288332 2325 2040 30718.53 36.94 30.6796 30.7165464 2325 2000 58857.7 72.232 57.4482 57.52043

128 2325 1818 112612.7 130.321 110.0355 110.1658256 2325 1694 222752.4 259.474 215.8702 216.1297

Fig. 8. Read clock speed vs. FIFO depth.

Fig. 9. Write clock speed vs. FIFO depth.

Fig. 10. Read, write speed of router and IP vs. FIFO depth of NI.


� Identifying the false path and identifying the asynchronous clock groups.� A false path is a logic path in the design that should not be analyzed for timing. The following paths should be set as false

paths to avoid timing failures in the synchronization registers of the asynchronous FIFO.� The paths crossing from writing clock domain into reading clock domain between the synchronized delayed write pointer

registers and read pointer registers of the asynchronous FIFO.

Fig. 12. Dynamic power dissipation vs. FIFO depth.

Fig. 11. Static power dissipation vs. FIFO depth.

Fig. 13. Total power dissipation vs. FIFO depth.


� The paths crossing from the read clock domain into the write clock domain between the synchronized delayed read poin-ter registers and write pointer registers of the asynchronous FIFO.

The unrelated clocks must be grouped together and set as a constraint for asynchronous group. In an asynchronous FIFOthe read domain and write domain clocks are grouped as ‘asynchronous group’ for STA. Similarly, in the integrated NI therouter clock domains and the IP clock domains are grouped together and set the asynchronous clock constraint. The falsepath constraint must be set to the unrelated paths of the NI for STA.

Comparative analyses of four types of encoded asynchronous FIFOs are shown in Table 1. It is observed that the read andwrite clock speed is gradually decreased with the subsequent increasing of the depth for all types FIFO as shown in Table 1and Figs. 8–10. An increase in the depth of the FIFO results in the increase of the number of the bit width of the binary coun-ter as well as Gray, one-hot and Johnson encoders for read and write pointers. Gray encoding presents an edge over otherFIFO encoding techniques in write and read speeds as it utilizes the lowest number of bits and a low switching activitybetween two consecutive address jumps. The read clock speed as shown in Fig. 8 is always higher due to less complexityin reading logic for all types of FIFO than the write clock speed as shown in Fig. 9. When it comes to write speeds, in

Fig. 15. Area of FIFOs vs. FIFO depth.

Fig. 14. Total power dissipation of NI vs. FIFO depth.

Table 2FIFO depth of network interface vs. area, power and clock speed of network interface.

FIFO depth IP clock speed in MHz Router clock speed in MHz Total area in lm2 Power Total power in mW

Static in lW Dynamic in mW

Network interface using Gray encoded FIFO4 2083 2040 22197.86 0.034078 23.3761 23.410188 2000 2000 32526.28 0.050726 32.7957 32.7957

16 1754 1851 52939.44 0.087785 51.1381 51.2258832 1639 1724 92033.76 0.145706 86.7315 86.8772164 1492 1515 167396.1 0.265348 156.532 156.7973

128 1298 1369 320963.5 0.505221 295.5932 296.0984256 1219 1282 617009.2 0.093086 572.5944 573.5253


one-hot encoding, the switching activity is always constant, but the number of bits for encoding increases as the FIFO depthincreases. Since Johnson encoding technique utilizes half number of bits when compared to one-hot encoding with variedswitching activity, it is observed that at higher FIFO depths, Johnson encoding performs well compared to one-hot encoding.Binary encoding has the advantage of utilizing less number of bits, but with higher switching activity. At lower FIFO depths,binary encoding technique may suffer, but as FIFO depth increases it gains advantage of utilizing lesser number of bits. Theabove situations are clearly depicted in the Figs. 8 and 9 which are based on the values taken from the Table 1.

The speed of whole NI with respect to the FIFO depth is shown in Fig. 10 and Table 2. The Figs. 8–16 are drawn based onTable 1 values. The variation of the graphs depends on read clock, write clock with respect to the depth of FIFO. When theFIFO depth increases, the memory of FIFO occupies more area, consumes more power and reduces the speed (both read andwrite clock speed) of the NI. The difference in the switching activity of different encoding schemes and the small variations inwrite to read clocks do not require high depth of FIFO. When FIFO depth is 4 or 8, the read and write pointers traverse almostall addresses causing more switching activity which results in the frequency fluctuations. Even though FIFO depth is more(say 32–256) the read and write pointer traverse few addresses as that of low depth case due to less frequency difference

Fig. 16. Area of NI vs. FIFO depth.


between read and write clock. This makes the read clock to remain almost constant in many FIFOs as shown in Fig. 8. Thewrite speed of the binary counter is very high under low depth of FIFO due to the usage of less number of bits in the binarycounter. Other encoding schemes requires a conversion (i.e. binary to Gray, binary to Johnson and binary to one-hot) in readand write pointers, which results in gradual decrease in writing speed, whereas binary encoding scheme remains constantirrespective of the depth as shown in Fig. 9. The fluctuations depend on switching activity of encoding schemes and depth ofthe FIFO. The IP clock speed is bit lower than the router clock speed due to the higher logic complexity involved in the pack-ing unit at the IP core side than the unpacking unit at the router side as shown in Fig. 10 and Table 2.

7.2.2. PowerThe total power dissipation in any VLSI system is the summation of switching power, short circuit power and leakage

power. The short circuit and leakage power is called as the static power and switching power as dynamic power. Static powerof binary encoded FIFO is lowered by 6 mW, 34 lW and 16 lW compared to Gray, Johnson and one-hot encoded FIFO respec-tively. The static power dissipation is low due to the usage of binary read and write pointers without employing any encod-ing schemes which result in a reduction in the number of gates compared to other FIFOs as shown in Fig. 11. Dynamic powerdissipation refers to the power consumed by a CMOS gate as a result of charging and discharging of the output capacitanceand also some of internodal capacitances. The dynamic power dissipation is low in Gray encoded FIFO because the number oftoggling is low during read pointer and write pointer increment as shown in Fig. 12. The total power dissipation of Grayencoded FIFO is low as shown in Fig. 13, and the total power dissipation of complete NI is shown in Fig. 14.

7.2.3. AreaThe area of asynchronous FIFO and complete NI increases proportionally with the depth of the FIFO as shown in Figs. 15

and 16. The increase in FIFO size increases the memory unit, number of bits in read and write pointer of NI which constitutesthe increase in the area of the entire module.

7.2.4. LatencyThe latency is defined as the number of clock edges after a read or a write operation occurring before the signal is

updated. The latency of the write clock domain to the read clock domain is different due to its asynchronous nature. Thetotal latency of NI is the summation of the latency caused by the double synchronizer flip flop on read and write pointersof the FIFO, latency of the packing/unpacking units and WISHBONE master/slave wrappers in the proposed design. Each flipflop in synchronizer of the asynchronous FIFO creates a latency of one clock cycle. In this proposed NI design, the asynchro-nous FIFO has been implemented as the number of flip flops in the synchronizer is configurable as per the design require-ment. The packing, unpacking modules and the bus wrappers use the common reset signal when sampling the data at thesame rising edge of their respective clock domains with respect to the full and the empty signal of the asynchronous FIFOs.The bus wrapper is efficiently designed as latency free and directly coupled with asynchronous FIFO and packing/unpackingmodules. The total latency of the NI is limited by the asynchronous FIFO and the packing unpacking modules. The latency isnegligible in the burst transfer mode when the latency is shared by large numbers of flits compared with flit by flit transfer. Acycle accurate simulation has been done on the entire modules using Modelsim10.0b and has found the exact value of thelatency in nano seconds.

The calculation of Latency/Throughput in the Receiver side (IP to Router) is as follows:

Total latency ¼Wishbone wrapper latency at the IP sideþ FIFO latencyþ Unpacking latencyþWISHBONE bus wrapper latency at the Router side:

¼ 0 nsþ 3:04 nsþ 2:43 nsþ 0 ns¼ 5:43 ns � 2 clock cycles ðwhen only one flip-flop is used for synchronizationÞ


The calculation of Latency/Throughput in the transmitter side (Router to IP) is as follows:

Table 3Individu

Subm

UnpWISHWISHRX aTX aPackOver

Total latency ¼ bus wrapper latency at the router sideþ FIFO latencyþ Unpacking latencyþbus wrapper latency at the IP side:¼ 0 nsþ 4:02 nsþ 0þ 2:32þ 0 ¼ 6:34 ns � 2 clock cyclesðwhen only oneflip-flop is used for synchronizationÞ

For asynchronous clock domain it is worth to mention the latency in nano second instead of the number of clock cycles.

7.2.5. ThroughputThe NoC data transmission occurs between the router and the IP core. The throughput differs from the NoC router to IP

core and IP core to router due to the packing and unpacking unit. The proposed low latency NI improves the throughput andthe overall performance. Table 4 shows the throughput of NI in router to IP and IP to router direction. Throughput is definedas the total number of flits processed by NI per second.

Throughput ¼ 1=ðlatency�ðflit=clockÞÞ

The latency is two clock cycles in the proposed design. For example the FIFO depth of NI is 8, the respective frequency is2000 MHz and flit size is 32 bit.

Throughput ¼ 1=ð2�ð1=2000� 106ÞÞ ¼ 109 ¼ 1000 Mflits=s ¼ 32;000 Mbits=s:

7.3. Individual module performance of proposed NI

The performance of individual module is shown in Table 3. The WISHBONE master/slave wrapper offer higher speed com-pared to other modules due to its simplicity in nature. The packing and the unpacking modules offer less speed and consumemore numbers of registers and LUTs compared to asynchronous FIFO and WISHBONE wrappers. It is obvious that the overallsystem counts of registers and LUTs is not equal to the summation of individual modules’ counts of registers and LUTs whensynthesizing the modules separately as shown in Table 3.

7.4. Performance comparison analysis with existing NI

The comparison of the proposed NI design with the existing NI design is done in ASIC with [10,19], and FPGA with[14,16–18] as the target technology/device as shown in Table 4. Avnet’s Xilinx Virtex-5 LX Evaluation Kit with deviceXC5VLX50-1FF676 is utilized for NI implementation and also selected target device of Virtex-5 XC5VLX30 during post-synthesis for comparing with existing device without using figure of merit. A clear cut comparison between previous worksand the proposed work is depicted in the Table 4. In S. no. 3 and S. no. 8, our proposed results with respect to ASIC design andFPGA design process are stated respectively. With device technology STM 90 nm, our design resulted in a throughput of1000 Mflits/s with a frequency of 2 GHz, which shows an improvement of 4.5 times compared to DMA based NI. It is verycomplicated to do exact comparison with existing implementations due to the fact that the existing works were imple-mented in different ASIC technologies as well as targeted different FPGA devices. Even though the proposed design is tar-geted to STMicroelectronics 90 nm CMOS technology as shown in S. no. 3 in Table 4, the entire design is synthesized onthe same FPGA target device as per the existing work as shown in S. no. 8 in Table 4 instead of doing complex calculationsutilizing technology scaling to convert performance metrics from ASIC to FPGA or vice versa to maintain the exact equivalentfor the comparison purpose. Compared to NI architecture for a NePA [10], the proposed design outperforms by 170% in speedand by 10% in the area. The proposed design offers very less latency of 2 cycles instead of 4 and 5 cycles, and the throughputof the proposed design is 1000Mflits/s instead of 179 Mflits/s when latency is 4 cycles and 143 Mflits/s when latency is 5cycles of the best performed existing ASIC designs as shown in Table 4.

al sub-module’s area and speed of proposed NI design targeted to the Xilinx FPGA.

odules Number of slice registers Number oflice LUTs

Number of fully usedLUT-FF pairs

Speed in MHz

acking unit 57 55 53 546BONE master slave wrapper at IP end 0 46 0 912BONE master slave wrapper at router end 0 46 0 912

synchronous FIFO 16 44 12 Wr-622/Rd-648synchronous FIFO 16 44 12 Wr-622/Rd-648ing unit 81 66 109 492all area 310 303 218 IP-430/RTR-412

Table 4Comparison with existing works.

S. no. Ref.no.

Bus/module Modes/method Target devicetechnology

SpeedMHz

ASIC area lm2 FPGA area Numberof Slice Registers (NSR)number of sliceLUTs (NSLUTS)

PowermW

Latencycycles

ThroughputMFlits/s

1 [10] DMA Buffered/unbuffered TSMC 90 nm 719 35,830 – 17.34 4/5 179/1432 [20] NS Asynchronous FIFO 180 nm 490 55,880 – 30.5 NA -3 This work WB Asynchronous FIFO STM 90 nm 2000 32,526 – 32 2 10004 [15] AHB MNI-4p/2p/cb xc5vlx50 310 – NSR-6232/6242/5201 3114/5578/9694 3/3/3 103/103/103

SNI-4p/2p/cb 262 NSLUTS-611/586/579 3792/6624/1068 6/4/3 43/65/87NSR-7792/7782/7622NSLUTS-906/890/846

5 [17] OCP MNA-hs/cb xc5vlx30 463/331 - NSR-473/590 26/30 3/3 154/110SNA-hs/cb 354/370 NSLUTS-338/391 28/30 10/4 118/123

NSR-772/649NSLUTS-694/531

6 [18] OCP MNI-hs/cb xc5vlx30 462/330 – NSR-601/590 60/118 3/3 154/110StoppableMNI-hs/cb 361/260 NSLUTS-356/590 22/74 3/3 120/86

NSR-743/602NSLUTS-666/393

7 [19] OCP MNA-hs/cb xc5vlx30 378/246 - NSR-1031/1116 64/40 3 126/82SNA-hs/cb 309/320 NSLUTS-772/811 20/24 3 103/106(PB, IB and SRMD) NSR-1579/1293

NSLUTS-2057/17238 This work WB Asynchronous FIFO xc5vlx30 IP-430 - NSR-310 50 2 215

xc5vlx50 RTR-412 NSLUTS-301 50 206

NS – Non-Specific.WB – WISHBONE.AHB – AMBA 2.0 AHB.4p – 4phase.2p – aphase.cb – credit based.PB – Precise burst.IB – Imprecise Burst(IB).SRMD – Single Request Multiple Data burst.NA – Not Available.MNA – Master Network Adapter.SNA – Slave Network Adaptor.MNI – Master Network Interface.SNI – Slave Network Interface.

K.Sw

aminathan

etal./Com

putersand

ElectricalEngineering

40(2014)

1838–1857

1855


In the FPGA based NI architecture the authors used standard buses to transfer the data from router to IP core using anMNA or MNI and to transfer the data from IP to the router using SNA or SNI by employing handshake and credit based flowcontrol with or without power saving modes. In a practical NoC fabric design, the routers operate on a single frequency of arouter clock domain and the IP core or a set of standard bus connected IP cores operate on a single frequency of the IP clockdomain. We need to find out the worst case frequency among the minimum of router side frequency and IP core side fre-quency. The maximum operating frequency of router (RTR) clock domain of the current proposed design is 412 MHz andthe IP clock domain frequency is 430 MHz. To calculate the throughput, the designers need to consider the lowest betweenthe two frequencies (i.e. 412 MHz). Compared to FPGA targeted NI architectures [14,16–18], the proposed design outper-forms by 16.38% in speed, by 52.58% in number of slice registers, and by 12.29% in number of slice LUTs. The proposed designoffers very less latency of 2 cycles instead of 3 cycles, and the throughput is improved by 33.76% by considering the bestperformed design [16] of existing NI architectures as shown in Table 4. Compared to all existing designs, the proposed designoffers very less latency of 2 cycles which is main reason for the improvement in throughput. For all comparison the FIFOdepth is maintained as 8. The comparison results shows that the proposed architecture outperforms in terms of speed, area,latency and throughput compared to all other architectures.

8. Conclusion

In order to speed up the data transfer in the NI for NoC, a generic asynchronous FIFO-based WISHBONE compatible plugand play NI for NoC design is presented in this paper. The existing AMBA, OCP and DMA-based NIs utilizing many types ofhandshake and credit based flow control offer high latency, low throughput and low speed. The proposed NI offers lowerlatency due to latency free wrappers with merged micro-level architecture of one clock cycle latency packing/unpackingmodules and optimum latency of one clock cycle asynchronous FIFO compared to the existing designs. The proposed NI isperformed well irrespective of the router and processing core frequencies and phase differences. This NI offers an easy inte-gration of WISHBONE compatible existing IPs and other IPs with minimal manual effort in a shorter design cycle. The wholedesign is verified using BFM based constraint random verification environment. The proposed design has been implementedin STMicroelectronics 90 nm CMOS standard cell and the entire design is verified in constrained random based verificationenvironment using Verilog-HDL. Experimental results show that the proposed NI offers a low latency of 2 clock cycles, 4.5times higher throughput when compared to the best available ASIC implementation and 33.76% more compared to the bestavailable FPGA implementation. The speed of the proposed NI is increased by 170% in ASIC design and 16% increase in FPGAbased design and the area is reduced by 10% in ASIC, 52% and 12% reduced in number of slice registers and LUTs in FPGAbased design.

Acknowledgements

This research was partially supported by the Canadian Bureau for International Education (CBIE) on behalf of ForeignAffairs and International Trade, Canada (DFAIT), under the Canadian Commonwealth Exchange Program-Asia Pacific(formerly GSEP), which is gratefully acknowledged.

References

[1] Dally W, Towles B. Principles and practices of interconnection networks. San Francisco (CA): Morgan Kaufmann Pub; 2004.[2] Dally W, Towles B. Route packets not wires: on-chip interconnection networks. In: Annual design automation conference; 2001. p. 684–9.[3] Benini L, De Micheli G. Networks on chips: a new SoC paradigm. IEEE Comput 2002;35:70–8.[4] Singh Sanjay Pratap, Bhoj Shilpa, Balasubramanian Dheera, Nagda Tanvi, Bhatia Dinesh, Balsara Poras. Generic network interfaces for plug and play

NoC based architecture. In: Lecture notes in computer science springer reconfigurable computing: architectures and applications; 2006. p. 287–98.[5] OpenCores WISHBONE Specification. WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores. OpenCores. rev. B.4; 2010.[6] Cummings C, Alfke P. Simulation and synthesis techniques for asynchronous FIFO design with asynchronous pointer comparison. In: SNUG; 2002.[7] Rahmani AM, Liljeberg O, Plosila J, Tenhunen H. An eficient VFI-based NoC architecture using Johnson-encoded reconfigurable FIFOs. In: IEEE

international norchip conference; 2010. p. 1–5.[8] Fattah M, Manian A, Rahimi A, Mohammadi S. A high throughput low power FIFO used for GALS NoC buffers. In: IEEE annual symposium on VLSI; 2010.

p. 333–8.[9] Bergeron J. Writing testbenches: functional verification of HDL models. Norwell (MA): Kluwer Academic Publishers; 2003.

[10] Lee Seung Eun, Bahn Jun Ho, Yang Yoon Seok, Bagherzadeh Nader. A generic network interface architecture for a networked processor array (NePA). In:ARCS; 2008. p. 247–60.

[11] Lai Yong-Long, Yang Shyue-Wen, Sheu Ming-Hwa, Hwang Yin-Tsung, Tang Hui-Yu, Huang Pin-Zhang. A high-speed network interface design forpacket-based NoC. In: IEEE ICCCAS; 2006. p. 2667–71.

[12] Beigne E, Vivet P. Design of on-chip and off-chip interfaces for a GALS NoC architecture. In: 12th IEEE international symposium on asynchronouscircuits and systems; 2006. p. 172–83.

[13] Thonnart Yvain, Beigné Edith, Vivet Pascal. Design and implementation of a GALS adapter for a NoC based architectures. In: ASYNC; 2009. p. 13–22.[14] Attia Brahim, Chouchene Wissem, Zitouni A, Nourdin A, Tourki R. Design and implementation of low latency network interface for network on chip. In:

International conference on design and test workshop; 2010. p. 37–42.[15] Ebrahimi Masoumeh, Daneshtalab Masoud, Sreejesh NP, Liljeberg Pasi, Tenhunen Hannu. Efficient network interface architecture for network-on-

chips. In: NORCHIP; 2009. p. 1–4.[16] Attia B, Zitouni A, Tourki R. Design and implementation of network interface compatible OCP for packet based NOC. In: International conference on

design & technology of integrated systems in nanoscale era; 2010. p. 1–8.[17] Chouchene W, Attia B, Zitouni A, Abid N, Tourki R. A low power network interface for network on chip. In: International multi-conference on systems,

signals and devices; 2011. p. 1–6.

http://refhub.elsevier.com/S0045-7906(14)00132-3/h0005




[18] Attia B, Chouchene Wissem, Zitouni Abdelkrim, Tourki Rached. Network interface sharing for SoCs based NoC. In: International conference oncommunications, computing and control applications; 2011. p. 1–6.

[19] Matos D, Costa M, Carro L, Susin A, Matos D, et al. Network interface to synchronize multiple packets on NoC-based systems-on-chip. In: VLSI systemon chip conference; 2010. p. 31–6.

[20] Atienza D, Angiolini F, Murali S, Pullini A, Benini L, De Micheli G. Network-on-chip design and synthesis outlook. Integration of VLSI J2008;41(3):340–59.

[21] Apperson RW, Yu Zhiyi, Meeuwsen MJ, Mohsenin T, Baas BM. A scalable dual-clock FIFO for data transfers between arbitrary and haltable clockdomains. IEEE Trans Very Large Scale Int (VLSI) Syst 2007;15(10):1125–34.

[22] Dally W, Poulton J. Digital systems engineering. Cambridge, UK: Cambridge Univ Press; 1998.[23] Xilinx verification document. BFM Simulation in Platform Studio. Xilinx Inc; 2004.[24] Gregg D, Tim L. Designing procedural-based behavioral bus functional models for high performance verification. In: SNUG; 1999.[25] Altera design document. SCFIFO and DCFIFO mega functions. Altera Inc

K. Swaminathan received his B.E. degree in Electrical and Electronics Engineering from IRTT, Erode, Bharathiar University, Indiain 2001. He received the M.E. degree (VLSI Design) from Govt. College of Technology, Coimbatore, India in 2005. He is currentlypursuing his Ph.D. degree at NIT, Tiruchirappalli. His research interests include System on Chip, Network on Chip and Design/Verification of Complex Digital Systems.

G. Lakshminarayanan received the M.E. and Ph.D. degrees in Electronics and Communication Engineering from BharathidasanUniversity, Tiruchirappalli, India, in 1995 and 2005, respectively. He is currently working as an Associate Professor in theDepartment of ECE, NIT, Tiruchirappalli. His current research interests include Reconfigurable Systems, VLSI based WirelessSystem Design, Algorithms and Techniques for Cognitive Radio and Network on Chip.

Seok-Bum Ko received his Ph.D. in Electrical & Computer Engineering at the URI, USA in 2002. He is currently an AssociateProfessor in the Department of Electrical & Computer Engineering at the University of Saskatchewan, Canada. His researchinterests include efficient hardware implementation of computer system, computer arithmetic, digital design automation andcomputer architecture. He is a senior member of IEEE Computer Society.






Design and verification of an efficient WISHBONE-based ... and veriﬁcation of an efﬁcient...

Documents

Transcript of Design and verification of an efficient WISHBONE-based ... and veriﬁcation of an efﬁcient...