Download - Network adapters

Transcript

Network�Adapter�Performance

We�have�used�netpipe�to�test�the�performance�of�different�Gigabit�network�cards.�TheIntel/MPIGAMMA�combination�has�the�lowest�latency,�outperforming�proprietary�RDMA�cards.�Weuse�bidirectional�messages�rather�than�the�standard�one-way�test,�since�this�is�more�closely�related�toreal-world�applications,�where�nodes�exchange�data�rather�than�just�send�a�message�from�A�to�B.�Wecompare�different�network�cards,�different�software�layers,�and�switches�vs.�crossover�cables;�detailsare�given�below.�The�one-way�data�rate�is�reported�in�MBytes/sec;�the�theoretical�maximum�forGigabit�Ethernet�is�125.�Note�that�the�output�file�from�netpipe�reports�the�total�message�buffer�size,twice�the�size�of�an�individual�message�(reported�below).

B'Com�A B'Com�B Intel�A Intel�B Intel�C Intel�D L5 A'sso*

Latency 59μs 20μs 63μs 32μs 8.8μs 12μs 20μs 30μs

Peak�rate�(MB/s) 99 90 107 113 116 110 102 74

Rate�at�64KB(MB/s)

89 81 95 105 116 110 102 69

Rate�at�16KB(MB/s)

74 84 66 82 99 82 80 55

Rate�at�4KB�(MB/s) 31 51 33 43 62 40 50 39

Rate�at�1KB�(MB/s) 13 22 12 22 45 31 26 16

Message�size�atmax

256KB 4096KB 512KB 256KB 64KB 96KB 64KB 128KB

Hardware:

1.� Dell�Pe850:�single�P4D�(3.0GHz�dual-core)�and�Dell�PE1750*:�dual�Xeon�(2.8GHz)2.� OS:�Centos�4.2�(Linux�2.6.9�and�Linux�2.6.12-gamma)�and�NPACI�Rocks�(Linux�2.4)*3.� Switches:�Crossover�cable,�Extreme�Networks�Summit400-t48,�Cisco�chassis�switch�(model6509)*

Network�card Driver MTU MPI Switch

Broadcom�5721�(A) tg3-3.43 1500 LAM-7.1 400-t48

Broadcom�5721�(B) GAMMA-06-08-08 1500 MPIGAMMA-06-07-17 400-t48

Intel�82545GM�(A) e1000-6.0.54-k2-NAPI 4120 LAM-7.1 Crossover

Intel�82545GM�(B) TIR=0�TID=128 4120 LAM-7.1 Crossover

Intel�82545GM�(C) GAMMA-06-02-17 4120 MPIGAMMA-06-02-09 Crossover

Intel�82545GM�(D) GAMMA-06-02-17 4120 MPIGAMMA-06-02-09 400-t48

Level�5�EF1-21022T Proprietary N/A Proprietary Crossover

Ammasso�1100 Proprietary N/A Proprietary Cisco�6509

Some�observations:

1.� The�Broadcom�5721�(B'Com�A)�performs�well�when�used�with�a�recent�tg3�driver�(fromBroadcom);�the�3.10�driver�in�the�2.6.9�kernel�is�not�as�good.�The�Intel�PRO�1000�(Intel�A)�has�ahigh�bandwidth�but�the�large�TCP�driver�latency�reduces�the�performance�for�small�messages.

2.� The�TCP�performance�of�the�Intel�NIC�(82545GM)�can�be�improved�by�tuning�the�driverparameters.�In�case�B�we�used�InterruptThrottleRate=0�and�TxIntDelay=128.�The�ITR=0�settingreduces�the�latency�by�a�factor�of�2;�TID=128�allows�for�reduced�cpu�load�with�no�measureableeffect�on�performance.�The�effects�of�driver�tuning�may�be�more�noticeable�in�practice�thanthese�limited�results�suggest.�We�notice�that�the�throughput�of�the�PRO�1000�with�default�TCPdriver�settings�is�quite�erratic�and�these�dropout�may�substantially�reduce�performance�withlarge�numbers�of�nodes.

3.� Tuning�the�TCP�buffer�size�brings�increased�performance.�Generally,�larger�TCP�buffers�increaseperformance�for�large�messages�at�the�cost�of�a�small�performance�penalty�for�smallermessages.�We�used�1Mbyte�buffers�which�we�found�to�be�a�good�compromise.�Increasing�theframe�size�(MTU)�has�a�similar�effect,�with�the�added�benefit�of�a�reduction�in�cpu�usage;�we

2010-11-29 Network adapters

ladd.che.ufl.edu/…/adapter.htm 1/2

used�MTU=4120�(4096+24)�in�these�tests.4.� MPIGAMMA�is�a�port�of�MPI�to�the�GAMMA�Ethernet�driver;�it�reduces�the�latency�of�thedefault�Intel�adapter�by�an�order�of�magnitude,�better�than�the�expensive�proprietary�interfacesfrom�Level�5�and�Ammasso.�It�is�only�compatible�with�the�2.6.12�and�2.6.18�kernels�at�present.The�hardware�latencies�of�the�Broadcom�and�Intel�NIC's�wired�back�to�back,�as�measured�byGAMMA�are�14μs�and�6.5μs�respectively.�There�is�an�additional�latency�of�2.3μs�from�theMPIGAMMA�software�layer�and�3.3μs�from�the�Summit400t-48�switch.�The�Broadcom�NICfailed�to�reach�its�asympotic�bandwidth�because�we�did�not�adjust�the�credit�limit�to�account�forthe�smaller�packet�size�(1500�bytes);�the�Intel�NICs�used�the�optimum�4120�byte�frame�size.

5.� With�a�low-latency�network�card�the�switch�latency�becomes�critical�and�the�Summit-t48performs�very�well�in�this�regard-the�additional�latency�from�the�switch�is�only�3.3μs.

6.� The�switch�reduces�the�maximum�throughput�under�MPIGAMMA�because�the�added�latencyrequires�a�larger�message�size�to�saturate.�GAMMA�was�configured�to�accept�up�to�32�4KBpackets�before�sending�an�acknowledgement�packet,�and�MPIGAMMA�therefore�takes�a�small(<10%)�performance�hit�around�128KBytes;�slightly�better�performance�can�be�obtained�at�thecost�of�more�memory�allocated�to�GAMMA.�In�bidirectional�mode�MPI�typically�reaches�a�peakperformance�for�messages�of�around�64KBytes,�when�it�switches�from�"eager"�to�"rendezvous"protocols.�LAM�has�a�user�interface�(SSI)�for�choosing�the�transition�message�size;�our�resultswith�LAM�used�the�eager�protocol�throughout,�which�was�found�to�deliver�the�maximumthroughput.

7.� The�overall�performance�of�a�GAMMA-enabled�Intel�PRO�NIC�is�remarkable;�real-worldapplications�can�pass�quite�small�bidirectional�messages�(32KBytes)�at�rates�in�excess�of�100MB/s�each�way.

8.� The�proprietary�RDMA�cards�from�Level5�and�Ammasso�(retail�price�around�$500)�used�tooutperform�TCP/IP�based�NICS,�but�with�newer�TCP�drivers�the�Intel�NIC�outperforms�bothproprietary�cards,�except�the�L5�at�small�packet�sizes.�The�Intel+MPIGAMMA�combination�isfaster�than�the�proprietary�cards.

9.� The�Ammasso�and�Level5�NICS�substantially�reduce�CPU�usage,�which�can�increaseperformance�if�computation�and�communications�are�overlapped;�GAMMA�requires�100%�CPUutilization�and�cannot�profit�from�overlapping.�The�TCP�driver�(e1000-6.0.54-k2-NAPI)�uses�about30%�of�the�CPU;�the�tuned�driver�uses�40-50%�CPU,�depending�on�TID�and�MTU.

10.� The�Ammasso�cards�were�tested�on�older�hardware�(2.8GHz�Xeons)�with�a�2.4�kernel;�the�othertests�were�with�3GhZ�P4D�and�the�2.6�kernel.

Sources�and�documentation

LAM�v7.1:�http://www.lam-mpi.orgMPI/GAMMA�http://www.disi.unige.it/project/gamma/mpigamma

Acknowledgements:�We�thank�Giuseppe�Ciaccio�for�extensive�help�in�setting�up�ourMPIGAMMA�installation�and�also�for�eliminating�a�number�of�obscure�bugs.�We�thank�theUniversity�of�Florida�High-Performance�Computer�Center�(http://www.hpc.ufl.edu)�for�access�tothe�Ammasso�and�Cisco�hardware.

2010-11-29 Network adapters

ladd.che.ufl.edu/…/adapter.htm 2/2