[IEEE 2010 VI Southern Programmable Logic Conference (SPL) - Ipojuca, Pernambuco, Brazil...

A FULL DUPLEX IMPLEMENTATION OF INTERNET PROTOCOL VERSION 4 IN AN FPGA DEVICE

Paulo César C. de Aguirre, Lucas Teixeira, Crístian Müller, Fernando Luís Herrmann, Leandro Z. Pieper, Josué de Freitas, Gustavo Dessbesell, João Baptista Martins

Electrical Engineers Course – Microelectronics Group Federal University of Santa Maria

Santa Maria, Brazil email: [email protected], [email protected],

[email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

ABSTRACT

This paper describes an implementation in hardware of Internet Protocol version 4. Routing and addressing features were integrated with Network Interfaces and synthesized to a Stratix II FPGA device. Our work showed two implementations of a full duplex Internet Protocol version 4. The first implementation consists in a Reference design and the second uses the same design but with more buffer space. We present the advantages and disadvantages of each implementation and also compare in terms of throughput, frame loss rate and power dissipation. The implementation with more buffer space presents a better performance in frame loss rate but it dissipates more power than the Reference design. Both implementations presented similar results for throughput tests.

1. INTRODUCTION

The development of a network protocol in hardware brings numerous benefits for the performance of any network. The latency, waiting time in the communication between two computers (hosts), is one of the main issues in the Internet [1]. The decrease of latency and increase of data throughput in switches and routers can be intrinsic reached by implementing the Internet Protocol (IP) in hardware.

Companies like Intel and CISCO have devoted great efforts in communication networks. Currently, these companies have marketed some communication protocols implemented in ASIC. Intel has already developed applications that join Gigabit Ethernet and PHY (Physical Layer) in a single integrated circuit [2] and other solutions in network processors [3]. On the other hand CISCO shows up with G8000 Packet Processor Family that supports Gigabit Ethernet and 10 Gigabit Ethernet standards, beyond working with IPv4 and IPv6 (Internet Protocol version 6).

In this context, the goal of this work is the exploration of the design space of an IPv4 hardware module, where MAC buffer size is the variable which

has been taken into account. First, a Reference design has been developed. It

works as a gateway between networks, performing routing and addressing functions of data packets. Then, a second design with increased buffer area (Buffer Increased Design) has been derived from the first and their performance has been compared.

The next section of this work presents the reference IPv4 hardware core and its main features. Next, the full system that was built in the FPGA device and the modifications in the reference IP-core are described on section 3. The tests that were used to verify and quantify the performance of the designs are described on section 4, while these results are discussed in the section 5. Finally, conclusions are shown in section 6.

2. IPV4 DEVELOPMENT

The IP protocol is responsible for sending and receiving data packets through the Internet and it is described in RFC 791 [4]. This protocol is not entirely reliable, but is a fast mechanism to data transfer. The version 4 of this protocol was chosen because it is the most used worldwide and gives a more area-effective

Fig. 1. Block diagram of the complete system

159978-1-4244-6311-4/10/$26.00 ©2010 IEEE

design compared to the more recent IPv6 protocol. The developed IP-core, whose architecture is

shown in the shaded area of Fig. 1, is divided in three main blocks, each one is responsible for a single task. Blocks Receiver and Sender are responsible to evaluate and forward, with necessary modifications, the received datagram.

The ARP (Address Resolution Protocol) is intrinsic to the addressing function. This is a partial implementation that performs address resolution of known hosts. The ARP protocol is responsible for locating and storing the MAC (Media Access Control) address of each host in the network. A static ARP table has been used due to schedule constraints, since a dynamic one would require more effort (thus time), besides not causing impact (nor positive, neither negative) on the measurements performed here.

3. FPGA PROTOTIPATION

The developed IP-core was implemented in a Stratix II EP2S60-F672C3N FPGA device and some tests were performed. The main functions performed in this implementation are routing and addressing.

3.1. Reference Design

The communication stack can be described as a 5 layer engine, 5 Conceptual Layers [5]: Application (upper), Transport, Internet, Network Interface and Hardware (lower). The prototipation of the design in an FPGA development board device includes the three lower communication layers, Internet, Network Interface and Hardware, from this conceptual stack.

The IP is responsible for the Internet layer. Any datagram arriving in this layer, if forwarded, should have its data field intact. IP is able just to modify its own header values during routing process.

The layer designed in this work was the Internet, while the other two were obtained, from Altera and a commercial ASIC.

The Network Interface layer is Altera’s Triple Speed Ethernet MegaCore version 9.0 [6], it was generated using 10/100/1000 Ethernet MAC Core Variation with MII/GMII interfaces – including

internal FIFOs. The Hardware layer consists in an ASIC (High

Performance Triple-Speed Marvell 88E1111 10/100/1000 Ethernet PHY [7]), available in an expansion board connected to the development board which contains the FPGA device. Two daughter boards were used in this implementation. Each one contains one network interface and a PHY.

The HDL (Hardware Description Language) used in the IP-core codification was SystemVerilog. After verified all functional requirements of the design a few more blocks were codified to allow the prototipation in a development board containing an FPGA device and network interfaces.

The full system that was build to test the Internet Layer in real application is shown in Fig. 1. Drivers where coded to adapt the MAC interfaces with Avalon Streaming [8] communication protocol to IPv4 interfaces with AMBA AXI [9] communication protocol. AMBA AXI protocol is used at IPv4 block because it is simple to implement and allows high frequency operation. Each input driver contains a buffer responsible for storing the frame incoming from the MAC. This buffers can store only one frame, so that the received frame is stored and just the Internet Protocol datagram is sent to IPv4-core. The MAC header is discarded by the driver. Analogous to the input drivers, the output ones have also buffers with the same storage capacity and build a frame header that is sent to the network layer.

A development kit named Nios II Development Board Stratix II Edition containing a Stratix II EP2S60-F672C3N FPGA device was used. Both logic synthesis and power dissipation estimation were performed using Altera Quartus II 9.0 tool. Synthesis results are shown in Table 1, while power dissipation estimates are depicted in Table 2.

3.2. Buffer Increased design

We proprosed a buffer size increase in the Network interface layer. Receiver and Sender FIFOs were set 64KB and 32KB size, respectively. This modification aims to store more frames to decrease the frame loss.

Table 3. Power Dissipation in Reference design

Block Type Total Power (mW) Routing Dynamic Power (mW)

Reference design Buffer Increased design Reference design Buffer Increased designPLL 13.41 13.41 0.00 0.00 I/O 38.43 38.36 0.52 0.55

Dedicated memory 62.30 432.74 4.37 18.94 Combinational cell 55.68 65.17 10.00 10.02

Register cell 127.23 193.91 65.55 122.36

160

4. PERFORMANCE MEASUREMENTS

In order to compare the performances of both implementations, some tests were performed. Such tests were based on RFC 2544 [10], which discusses and defines a number of tests that may be used to describe the performance characteristics of a network interconnecting device.

4.1. Test Environment

The tester was implemented in a development board containing a Xilinx Virtex 4 XC4VSX35-10FF668 FPGA device. It was in charge of sending a burst of frames to the Device Under Test (DUT) and then receiving DUT answers. Tests were performed using frame having 64, 128, 256, 512, 1024, 1280 and 1518 bytes. These are the standard sizes for Ethernet testing proposed in the respective RFC.

Frames were injected in bursts of five seconds with pauses of three seconds between each burst. This test was repeated five times and the average values were computed.

Only one Gigabit Ethernet interface was used during test sessions since the same path is traversed by any datagram handled by the IPv4 hardware core, regardless the source and destination network interface. Injected frames were composed of UDP frames on IP data field, thus they were sent to the IP-core and routed by it.

4.2. Throughput Test

The Throughput test determines the maximum frame rate without the DUT losing any frame. A pre-determined number of frames was sent for a full (100%) frame rate. In full frame rate the maximum throughput for a Gigabit Ethernet Interface is reached.This means that between two frames there is

only the minimum time of the Inter Frame GAP (IFG) [11], which is 96 ns (12 clock cycles of 8 ns).

The Throughput test was carried out by sending a determined number of frames at a specific rate to the DUT and then counting the frames that are successfully transmitted by it. In case of frame loss (less frames received back from the DUT than sent to it), stream rate is decreased and the test repeated.

4.3. Frame Loss Rate Test

In a test, similar to the one previously described, the frame loss rate was evaluated. This test begins at 100% frame rate by sending a pre-determined number of frames and reporting the percentage of lost frames. The frame rate was reduced 5% each step reporting for this frame rate the percentage of lost frames.

5. RESULTS AND EVALUATION

Starting the comparison between the two distinct implementations from an utilization and power consumption point of view, the Buffer Increased design, as expected, has increased requirements for both aspects when compared with the original design. As shown in Table 2, although logic requirements are almost identical for both designs, memory block requirements increased from 3% to 77%. Similarly, according to Table 3, estimated power consumption increased from 297.05mW to 743.59mW.Although presenting drawbacks regarding power and area requirements, performance results are considerably better in the Buffer Increased design when it comes to frame loss rate.

Looking at Fig. 3 and Fig. 4., one can notice that both designs present similar results up to a 40% input frame rate. However, from this point onwards, frame loss rate is not only significantly smaller in the Buffer Increased design, but also presents a regular frame loss pattern. Taking into account input rates over 45%, frame loss rate in the modified design is almost 14 times smaller in the best case (50% input rate and 1024-byte frame size), 1.2 times smaller in the worst case (85% input rate and 64-byte frame size) and around 3 times smaller in average.

Results concerning throughput are shown in Fig. 2. As mentioned in section 4, they refer to cases where the frame is successfully handled by the DUT (not

Fig. 2. Maximum throughput reached with each frame size.

Table 2. Total Logic Blocks Used In FPGA Device

Type Reference design

Buffer Increased design

Combinational ALUTs 9% (4,313) 10% (4,874)

Dedicated logic registers 9% (4,187) 11% (5,133)

Total block memory bits 3% (77,392) 77% (1,956,576)

161

lost). One can notice that results are similar in most cases. In a particular case, however, the Reference design presented a throughput almost 2 times higher (for a 512-byte frame).

6. CONCLUSIONS

This work presented implementation and performance results for a reference and a modified (with more buffer area) design of a hardware IPv4 block featuring a full communication stack..The modified design consumes 2.5 times more power and around 26 times more area (mainly memory blocks) than the original one.

However, frame loss rate shown by then modified design proved to be significantly lower in some circumstances, ranging from around 14 times to 1.2 times lower in the best and worst cases, for input rates over 50%.

Throughput, on the other hand, is considerably similar for both designs in most cases, but diverges (for the best and the worst) in some cases. It is believed that in issue in the memory management block is preventing the modified design from presenting better throughput results than the original one. This issue is under investigation and better throughput performance is expected on the modified design for a near future.

REFERENCES

[1] BORELLA, M.S.; SEARS, A.; JACKO, J.A. The effects of Internet latency on user perception of information content – IEEE Global Telecommunications, 1997.

[2] Intel. “82544EI Gigabit Ethernet Controller Datasheet”. Available:

http://download.intel.com/design/network/datashts/82544ei.pdf.

[3] Intel. “Intel Network Processors”. Available: http://www.intel.com/design/network/products/npfamily/index.htm?iid=ncdcnav2+proc_netproc.

[4] IEEE. Defense Advanced Research Projects Agency Request for Comment, Virginia, E.U.A, 1981. Available: http://www.ietf.org/rfc/rfc0791.txt?number=791.

[5] D. E. Comer. “The TCP/IP 5-Layer Reference Model” in “Internetworking with TCP/IP”, U.S. 4th Edition, vol. 1, pp. 183-185. ISBN 0-13-018380-6.

[6] Altera. “Triple Speed Ethernet Megacore Function”. Available: http://www.altera.com/products/ip/iup/ethernet/m-alt-ethernet-mac.html.

[7] Ethernet PHY Daughter Board. “10/100/1000 PHY Daughter Board 88E1111”. Available: http://www.morethanip.com/boards_10_100_1000_88E1111.htm.

[8] Altera. “Avalon Interface Specification”. Available: http://www.altera.com/literature/manual/mnl_avalon_spec.pdf

[9] ARM. AMBA AXI Protocol Specification”. Available: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0022b/index.html

[10] Network Working Group. “Benchmarking Methodology for Network Interconnect Devices”, Harvard, E.U.A., 1999. Available: http://www.ietf.org/rfc/rfc2544.txt.

[11] IEEE. IEEE 802.3 LAN/MAN Carrier sense multiple access with Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications. 2008. Available at: http://standards.ieee.org/getieee802/download/802.3-2008_section1.pdf

Fig. 3. Frame Loss Rate Results for the Reference design.

Fig. 4. Frame Loss Rate Results for the Buffer Increased design.

162

[IEEE 2010 VI Southern Programmable Logic Conference (SPL) - Ipojuca, Pernambuco, Brazil...

Documents

Transcript of [IEEE 2010 VI Southern Programmable Logic Conference (SPL) - Ipojuca, Pernambuco, Brazil...