Network adapters

2
Network Adapter Performance We have used netpipe to test the performance of different Gigabit network cards. The Intel/MPIGAMMA combination has the lowest latency, outperforming proprietary RDMA cards. We use bidirectional messages rather than the standard one-way test, since this is more closely related to real-world applications, where nodes exchange data rather than just send a message from A to B. We compare different network cards, different software layers, and switches vs. crossover cables; details are given below. The one-way data rate is reported in MBytes/sec; the theoretical maximum for Gigabit Ethernet is 125. Note that the output file from netpipe reports the total message buffer size, twice the size of an individual message (reported below). B'Com A B'Com B Intel A Intel B Intel C Intel D L5 A'sso* Latency 59μs 20μs 63μs 32μs 8.8μs 12μs 20μs 30μs Peak rate (MB/s) 99 90 107 113 116 110 102 74 Rate at 64KB (MB/s) 89 81 95 105 116 110 102 69 Rate at 16KB (MB/s) 74 84 66 82 99 82 80 55 Rate at 4KB (MB/s) 31 51 33 43 62 40 50 39 Rate at 1KB (MB/s) 13 22 12 22 45 31 26 16 Message size at max 256KB 4096KB 512KB 256KB 64KB 96KB 64KB 128KB Hardware: 1. Dell Pe850: single P4D (3.0GHz dual-core) and Dell PE1750*: dual Xeon (2.8GHz) 2. OS: Centos 4.2 (Linux 2.6.9 and Linux 2.6.12-gamma) and NPACI Rocks (Linux 2.4)* 3. Switches: Crossover cable, Extreme Networks Summit400-t48, Cisco chassis switch (model 6509)* Network card Driver MTU MPI Switch Broadcom 5721 (A) tg3-3.43 1500 LAM-7.1 400-t48 Broadcom 5721 (B) GAMMA-06-08-08 1500 MPIGAMMA-06-07-17 400-t48 Intel 82545GM (A) e1000-6.0.54-k2-NAPI 4120 LAM-7.1 Crossover Intel 82545GM (B) TIR=0 TID=128 4120 LAM-7.1 Crossover Intel 82545GM (C) GAMMA-06-02-17 4120 MPIGAMMA-06-02-09 Crossover Intel 82545GM (D) GAMMA-06-02-17 4120 MPIGAMMA-06-02-09 400-t48 Level 5 EF1-21022T Proprietary N/A Proprietary Crossover Ammasso 1100 Proprietary N/A Proprietary Cisco 6509 Some observations: 1. The Broadcom 5721 (B'Com A) performs well when used with a recent tg3 driver (from Broadcom); the 3.10 driver in the 2.6.9 kernel is not as good. The Intel PRO 1000 (Intel A) has a high bandwidth but the large TCP driver latency reduces the performance for small messages. 2. The TCP performance of the Intel NIC (82545GM) can be improved by tuning the driver parameters. In case B we used InterruptThrottleRate=0 and TxIntDelay=128. The ITR=0 setting reduces the latency by a factor of 2; TID=128 allows for reduced cpu load with no measureable effect on performance. The effects of driver tuning may be more noticeable in practice than these limited results suggest. We notice that the throughput of the PRO 1000 with default TCP driver settings is quite erratic and these dropout may substantially reduce performance with large numbers of nodes. 3. Tuning the TCP buffer size brings increased performance. Generally, larger TCP buffers increase performance for large messages at the cost of a small performance penalty for smaller messages. We used 1Mbyte buffers which we found to be a good compromise. Increasing the frame size (MTU) has a similar effect, with the added benefit of a reduction in cpu usage; we 2010-11-29 Network adapters ladd.che.ufl.edu/…/adapter.htm 1/2

description

http://ladd.che.ufl.edu/research/beoclus/adapter.htm

Transcript of Network adapters

Network�Adapter�Performance

We�have�used�netpipe�to�test�the�performance�of�different�Gigabit�network�cards.�TheIntel/MPIGAMMA�combination�has�the�lowest�latency,�outperforming�proprietary�RDMA�cards.�Weuse�bidirectional�messages�rather�than�the�standard�one-way�test,�since�this�is�more�closely�related�toreal-world�applications,�where�nodes�exchange�data�rather�than�just�send�a�message�from�A�to�B.�Wecompare�different�network�cards,�different�software�layers,�and�switches�vs.�crossover�cables;�detailsare�given�below.�The�one-way�data�rate�is�reported�in�MBytes/sec;�the�theoretical�maximum�forGigabit�Ethernet�is�125.�Note�that�the�output�file�from�netpipe�reports�the�total�message�buffer�size,twice�the�size�of�an�individual�message�(reported�below).

B'Com�A B'Com�B Intel�A Intel�B Intel�C Intel�D L5 A'sso*

Latency 59μs 20μs 63μs 32μs 8.8μs 12μs 20μs 30μs

Peak�rate�(MB/s) 99 90 107 113 116 110 102 74

Rate�at�64KB(MB/s)

89 81 95 105 116 110 102 69

Rate�at�16KB(MB/s)

74 84 66 82 99 82 80 55

Rate�at�4KB�(MB/s) 31 51 33 43 62 40 50 39

Rate�at�1KB�(MB/s) 13 22 12 22 45 31 26 16

Message�size�atmax

256KB 4096KB 512KB 256KB 64KB 96KB 64KB 128KB

Hardware:

1.� Dell�Pe850:�single�P4D�(3.0GHz�dual-core)�and�Dell�PE1750*:�dual�Xeon�(2.8GHz)2.� OS:�Centos�4.2�(Linux�2.6.9�and�Linux�2.6.12-gamma)�and�NPACI�Rocks�(Linux�2.4)*3.� Switches:�Crossover�cable,�Extreme�Networks�Summit400-t48,�Cisco�chassis�switch�(model6509)*

Network�card Driver MTU MPI Switch

Broadcom�5721�(A) tg3-3.43 1500 LAM-7.1 400-t48

Broadcom�5721�(B) GAMMA-06-08-08 1500 MPIGAMMA-06-07-17 400-t48

Intel�82545GM�(A) e1000-6.0.54-k2-NAPI 4120 LAM-7.1 Crossover

Intel�82545GM�(B) TIR=0�TID=128 4120 LAM-7.1 Crossover

Intel�82545GM�(C) GAMMA-06-02-17 4120 MPIGAMMA-06-02-09 Crossover

Intel�82545GM�(D) GAMMA-06-02-17 4120 MPIGAMMA-06-02-09 400-t48

Level�5�EF1-21022T Proprietary N/A Proprietary Crossover

Ammasso�1100 Proprietary N/A Proprietary Cisco�6509

Some�observations:

1.� The�Broadcom�5721�(B'Com�A)�performs�well�when�used�with�a�recent�tg3�driver�(fromBroadcom);�the�3.10�driver�in�the�2.6.9�kernel�is�not�as�good.�The�Intel�PRO�1000�(Intel�A)�has�ahigh�bandwidth�but�the�large�TCP�driver�latency�reduces�the�performance�for�small�messages.

2.� The�TCP�performance�of�the�Intel�NIC�(82545GM)�can�be�improved�by�tuning�the�driverparameters.�In�case�B�we�used�InterruptThrottleRate=0�and�TxIntDelay=128.�The�ITR=0�settingreduces�the�latency�by�a�factor�of�2;�TID=128�allows�for�reduced�cpu�load�with�no�measureableeffect�on�performance.�The�effects�of�driver�tuning�may�be�more�noticeable�in�practice�thanthese�limited�results�suggest.�We�notice�that�the�throughput�of�the�PRO�1000�with�default�TCPdriver�settings�is�quite�erratic�and�these�dropout�may�substantially�reduce�performance�withlarge�numbers�of�nodes.

3.� Tuning�the�TCP�buffer�size�brings�increased�performance.�Generally,�larger�TCP�buffers�increaseperformance�for�large�messages�at�the�cost�of�a�small�performance�penalty�for�smallermessages.�We�used�1Mbyte�buffers�which�we�found�to�be�a�good�compromise.�Increasing�theframe�size�(MTU)�has�a�similar�effect,�with�the�added�benefit�of�a�reduction�in�cpu�usage;�we

2010-11-29 Network adapters

ladd.che.ufl.edu/…/adapter.htm 1/2

used�MTU=4120�(4096+24)�in�these�tests.4.� MPIGAMMA�is�a�port�of�MPI�to�the�GAMMA�Ethernet�driver;�it�reduces�the�latency�of�thedefault�Intel�adapter�by�an�order�of�magnitude,�better�than�the�expensive�proprietary�interfacesfrom�Level�5�and�Ammasso.�It�is�only�compatible�with�the�2.6.12�and�2.6.18�kernels�at�present.The�hardware�latencies�of�the�Broadcom�and�Intel�NIC's�wired�back�to�back,�as�measured�byGAMMA�are�14μs�and�6.5μs�respectively.�There�is�an�additional�latency�of�2.3μs�from�theMPIGAMMA�software�layer�and�3.3μs�from�the�Summit400t-48�switch.�The�Broadcom�NICfailed�to�reach�its�asympotic�bandwidth�because�we�did�not�adjust�the�credit�limit�to�account�forthe�smaller�packet�size�(1500�bytes);�the�Intel�NICs�used�the�optimum�4120�byte�frame�size.

5.� With�a�low-latency�network�card�the�switch�latency�becomes�critical�and�the�Summit-t48performs�very�well�in�this�regard-the�additional�latency�from�the�switch�is�only�3.3μs.

6.� The�switch�reduces�the�maximum�throughput�under�MPIGAMMA�because�the�added�latencyrequires�a�larger�message�size�to�saturate.�GAMMA�was�configured�to�accept�up�to�32�4KBpackets�before�sending�an�acknowledgement�packet,�and�MPIGAMMA�therefore�takes�a�small(<10%)�performance�hit�around�128KBytes;�slightly�better�performance�can�be�obtained�at�thecost�of�more�memory�allocated�to�GAMMA.�In�bidirectional�mode�MPI�typically�reaches�a�peakperformance�for�messages�of�around�64KBytes,�when�it�switches�from�"eager"�to�"rendezvous"protocols.�LAM�has�a�user�interface�(SSI)�for�choosing�the�transition�message�size;�our�resultswith�LAM�used�the�eager�protocol�throughout,�which�was�found�to�deliver�the�maximumthroughput.

7.� The�overall�performance�of�a�GAMMA-enabled�Intel�PRO�NIC�is�remarkable;�real-worldapplications�can�pass�quite�small�bidirectional�messages�(32KBytes)�at�rates�in�excess�of�100MB/s�each�way.

8.� The�proprietary�RDMA�cards�from�Level5�and�Ammasso�(retail�price�around�$500)�used�tooutperform�TCP/IP�based�NICS,�but�with�newer�TCP�drivers�the�Intel�NIC�outperforms�bothproprietary�cards,�except�the�L5�at�small�packet�sizes.�The�Intel+MPIGAMMA�combination�isfaster�than�the�proprietary�cards.

9.� The�Ammasso�and�Level5�NICS�substantially�reduce�CPU�usage,�which�can�increaseperformance�if�computation�and�communications�are�overlapped;�GAMMA�requires�100%�CPUutilization�and�cannot�profit�from�overlapping.�The�TCP�driver�(e1000-6.0.54-k2-NAPI)�uses�about30%�of�the�CPU;�the�tuned�driver�uses�40-50%�CPU,�depending�on�TID�and�MTU.

10.� The�Ammasso�cards�were�tested�on�older�hardware�(2.8GHz�Xeons)�with�a�2.4�kernel;�the�othertests�were�with�3GhZ�P4D�and�the�2.6�kernel.

Sources�and�documentation

LAM�v7.1:�http://www.lam-mpi.orgMPI/GAMMA�http://www.disi.unige.it/project/gamma/mpigamma

Acknowledgements:�We�thank�Giuseppe�Ciaccio�for�extensive�help�in�setting�up�ourMPIGAMMA�installation�and�also�for�eliminating�a�number�of�obscure�bugs.�We�thank�theUniversity�of�Florida�High-Performance�Computer�Center�(http://www.hpc.ufl.edu)�for�access�tothe�Ammasso�and�Cisco�hardware.

2010-11-29 Network adapters

ladd.che.ufl.edu/…/adapter.htm 2/2