Http://grid.ucl.ac.uk/nfnn.html 20th-21st June 2005 NeSC Edinburgh End-user systems: NICs,...
-
date post
15-Jan-2016 -
Category
Documents
-
view
214 -
download
0
Transcript of Http://grid.ucl.ac.uk/nfnn.html 20th-21st June 2005 NeSC Edinburgh End-user systems: NICs,...
http://grid.ucl.ac.uk/nfnn.html
20th-21st June 2005NeSC Edinburgh
End-user systems:NICs, MotherBoards, Disks, TCP Stacks & Applications
Richard Hughes-Jones
Work reported is from many Networking Collaborations
MB - NG
Slide: 2Richard Hughes-Jones
Network Performance Issues
End System IssuesNetwork Interface Card and Driver and their configuration Processor speedMotherBoard configuration, Bus speed and capabilityDisk SystemTCP and its configuration Operating System and its configuration
Network Infrastructure IssuesObsolete network equipmentConfigured bandwidth restrictionsTopologySecurity restrictions (e.g., firewalls)Sub-optimal routingTransport Protocols
Network Capacity and the influence of Others!Congestion – Group, Campus, Access linksMany, many TCP connectionsMice and Elephants on the path
Slide: 3Richard Hughes-Jones
Methodology used in testing NICs & Motherboards
Slide: 4Richard Hughes-Jones
UDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program
Latency Round trip times measured using Request-Response UDP frames Latency as a function of frame size
Slope is given by:
Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)
Intercept indicates: processing times + HW latencies Histograms of ‘singleton’ measurements Tells us about:
Behavior of the IP stack The way the HW operates Interrupt coalescence
pathsdata dt
db1 s
Latency Measurements
Slide: 5Richard Hughes-Jones
Throughput Measurements
UDP Throughput Send a controlled stream of UDP frames spaced at regular intervals
n bytes
Number of packets
Wait timetime
Zero stats OK done
●●●
Get remote statistics Send statistics:No. receivedNo. lost + loss patternNo. out-of-orderCPU load & no. int1-way delay
Send data frames at regular intervals
●●●
Time to send Time to receive
Inter-packet time(Histogram)
Signal end of testOK done
Time
Sender Receiver
Slide: 6Richard Hughes-Jones
PCI Bus & Gigabit Ethernet Activity
PCI Activity Logic Analyzer with
PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC
GigabitEthernetProbe
CPU
mem
chipset
NIC
CPU
mem
NIC
chipset
Logic AnalyserDisplay
PCI bus PCI bus
Possible Bottlenecks
Slide: 7Richard Hughes-Jones
SuperMicro P4DP8-2G (P4DP6) Dual Xeon 400/522 MHz Front side bus
6 PCI PCI-X slots 4 independent PCI buses
64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X
Dual Gigabit Ethernet Adaptec AIC-7899W
dual channel SCSI UDMA/100 bus master/EIDE channels
data transfer rates of 100 MB/sec burst
“Server Quality” Motherboards
Slide: 8Richard Hughes-Jones
“Server Quality” Motherboards
Boston/Supermicro H8DAR Two Dual Core Opterons 200 MHz DDR Memory
Theory BW: 6.4Gbit
HyperTransport
2 independent PCI buses 133 MHz PCI-X
2 Gigabit Ethernet SATA
( PCI-e )
Slide: 9Richard Hughes-Jones
NIC & Motherboard Evaluations
Slide: 10Richard Hughes-Jones
SuperMicro 370DLE: Latency: SysKonnect
PCI:32 bit 33 MHz Latency small 62 µs & well behaved Latency Slope 0.0286 µs/byte Expect: 0.0232 µs/byte
PCI 0.00758
GigE 0.008
PCI 0.00758
PCI:64 bit 66 MHz Latency small 56 µs & well behaved Latency Slope 0.0231 µs/byte Expect: 0.0118 µs/byte
PCI 0.00188
GigE 0.008
PCI 0.00188 Possible extra data moves ?
Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz RedHat 7.1 Kernel 2.4.14 SysKonnect 32bit 33 MHz
y = 0.0286x + 61.79
y = 0.0188x + 89.756
0
20
40
60
80
100
120
140
160
0 500 1000 1500 2000 2500 3000Message length bytes
Late
ncy u
s SysKonnect 64 bit 66 MHz
y = 0.0231x + 56.088
y = 0.0142x + 81.975
0
20
40
60
80
100
120
140
160
0 500 1000 1500 2000 2500 3000
Message length bytes
Late
ncy u
s
Slide: 11Richard Hughes-Jones
SuperMicro 370DLE: Throughput: SysKonnect
PCI:32 bit 33 MHz Max throughput 584Mbit/s No packet loss >18 us
spacing
PCI:64 bit 66 MHz Max throughput 720 Mbit/s No packet loss >17 us
spacing Packet loss during BW drop
95-100% Kernel mode
Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz RedHat 7.1 Kernel 2.4.14
SysKonnect 32bit 33 MHz
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40Transmit Time per frame us
Rec
v W
ire r
ate
Mbi
ts/s
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
SysKonnnect 64bit 66MHz
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40
Transmit Time per frame us
Rec
v W
ire r
ate
Mbi
ts/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
SysKonnect 64 bit 66 MHz
0
20
40
60
80
100
0 10 20 30 40Spacing between frames us
% C
PU
Kern
el
Receiv
er
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
Slide: 12Richard Hughes-Jones
SuperMicro 370DLE: PCI: SysKonnect Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz PCI:64 bit 66 MHz RedHat 7.1 Kernel 2.4.14
Receive transfer
Send PCI
Packet on Ethernet Fibre
Send setup
Send transfer
Receive PCI
1400 bytes sent Wait 100 us ~8 us for send or receive Stack & Application overhead ~ 10 us / node
~36 us
Slide: 13Richard Hughes-Jones
Send PCI
Receive PCI
Send setup
Data Transfers
Receive Transfers
Send PCI
Receive PCI
Send setup
Data Transfers
Receive Transfers
Signals on the PCI bus 1472 byte packets every 15 µs Intel Pro/1000
PCI:64 bit 33 MHz
82% usage
PCI:64 bit 66 MHz
65% usage
Data transfers half as long
Slide: 14Richard Hughes-Jones
SuperMicro 370DLE: PCI: SysKonnect
1400 bytes sent Wait 20 us
1400 bytes sent Wait 10 us
Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz PCI:64 bit 66 MHz RedHat 7.1 Kernel 2.4.14
Frames on Ethernet Fiber 20 us spacing
Frames are back-to-back800 MHz Can drive at line speed
Cannot go any faster !
Slide: 15Richard Hughes-Jones
SuperMicro 370DLE: Throughput: Intel Pro/1000
Max throughput 910 Mbit/s No packet loss >12 us spacing
Packet loss during BW drop CPU load 65-90% spacing < 13
us
Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz PCI:64 bit 66 MHz RedHat 7.1 Kernel 2.4.14 Intel Pro/1000 64bit 66MHz
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40Transmit Time per frame us
Recv
Wire
rate
M
bits
/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
UDP Intel Pro/1000 : 370DLE 64bit 66MHz
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40Transmit Time per frame us
% P
acke
t los
s 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
Slide: 16Richard Hughes-Jones
SuperMicro 370DLE: PCI: Intel Pro/1000 Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz PCI:64 bit 66 MHz RedHat 7.1 Kernel 2.4.14
Send PCI
Receive PCI
Send 64 byte request
Send Interrupt processing
Interrupt delay
Request received
Receive Interrupt processing
1400 byte response
Request – Response Demonstrates interrupt coalescence No processing directly after each transfer
Slide: 17Richard Hughes-Jones
SuperMicro P4DP6: Latency Intel Pro/1000
Some steps Slope 0.009 us/byte Slope flat sections : 0.0146
us/byte Expect 0.0118 us/byte
No variation with packet size FWHM 1.5 us Confirms timing reliable
Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19 Intel 64 bit 66 MHz
y = 0.0093x + 194.67
y = 0.0149x + 201.75
0
50
100
150
200
250
300
0 500 1000 1500 2000 2500 3000Message length bytes
Late
ncy u
s
64 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
900
170 190 210
Latency us
N(t)
512 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
170 190 210Latency us
N(t)
1024 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
190 210 230
Latency us
N(t)
1400 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
190 210 230
Latency us
N(t)
Slide: 18Richard Hughes-Jones
SuperMicro P4DP6: Throughput Intel Pro/1000
Max throughput 950Mbit/s No packet loss
Averages are misleading CPU utilisation on the
receiving PC was ~ 25 % for packets > than 1000 bytes
30- 40 % for smaller packets
Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19
gig6-7 Intel pci 66 MHz 27nov02
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40Transmit Time per frame us
Recv
Wire
rate
M
bits
/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
gig6-7 Intel pci 66 MHz 27nov02
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Transmit Time per frame us
% C
PU
Idl
e R
ecei
ver
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
Slide: 19Richard Hughes-Jones
SuperMicro P4DP6: PCI Intel Pro/1000
1400 bytes sent Wait 12 us ~5.14us on send PCI bus PCI bus ~68% occupancy ~ 3 us on PCI for data recv
CSR access inserts PCI STOPs NIC takes ~ 1 us/CSR CPU faster than the NIC !
Similar effect with the SysKonnect NIC
Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19
Slide: 20Richard Hughes-Jones
SuperMicro P4DP8-G2: Throughput Intel onboard
Max throughput 995Mbit/s No packet loss
Averages are misleading 20% CPU utilisation receiver
packets > 1000 bytes 30% CPU utilisation smaller
packets
Motherboard: SuperMicro P4DP8-G2 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.4 GHz PCI-X:64 bit RedHat 7.3 Kernel 2.4.19 Intel-onboard
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40Transmit Time per frame us
Recv
Wire
rate
M
bits
/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
Intel-onboard
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Transmit Time per frame us
% C
PU
Idle
Receiv
er 50 bytes
100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
Slide: 21Richard Hughes-Jones
Interrupt Coalescence: Throughput
Intel Pro 1000 on 370DLE
Throughput 1472 byte packets
0
100
200
300
400
500
600
700
800
900
0 10 20 30 40
Delay between transmit packets us
Re
ce
ive
d W
ire
ra
te M
bit
/s coa5
coa10
coa20
coa40
coa64
coa100
Throughput 1000 byte packets
0
100
200
300
400
500
600
700
800
0 10 20 30 40
Delay between transmit packets us
Receiv
ed
Wir
e r
ate
Mb
it/s
coa5
coa10
coa20
coa40
coa64
coa100
Slide: 22Richard Hughes-Jones
Interrupt Coalescence Investigations Set Kernel parameters for
Socket Buffer size = rtt*BW
TCP mem-mem lon2-man1 Tx 64 Tx-abs 64 Rx 0 Rx-abs 128 820-980 Mbit/s +- 50 Mbit/s
Tx 64 Tx-abs 64 Rx 20 Rx-abs 128 937-940 Mbit/s +- 1.5 Mbit/s
Tx 64 Tx-abs 64 Rx 80 Rx-abs 128 937-939 Mbit/s +- 1 Mbit/s
Slide: 23Richard Hughes-Jones
Supermarket Motherboards
and other
“Challenges”
Slide: 24Richard Hughes-Jones
Tyan Tiger S2466N
Motherboard: Tyan Tiger S2466N PCI: 1 64bit 66 MHz CPU: Athlon MP2000+ Chipset: AMD-760 MPX
3Ware forces PCI bus to 33 MHz
Tyan to MB-NG SuperMicroNetwork mem-mem 619 Mbit/s
Slide: 25Richard Hughes-Jones
IBM das: Throughput: Intel Pro/1000
Max throughput 930Mbit/s No packet loss > 12 us Clean behaviour Packet loss during drop
UDP Intel pro1000 IBMdas 64bit 33 MHz
0
100
200
300
400
500
600
700
800
900
1000
0 5 10 15 20 25 30 35 40
Transmit Time per frame usR
ecv W
ire r
ate
Mb
its/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
Motherboard: IBM das Chipset:: ServerWorks CNB20LE CPU: Dual PIII 1GHz PCI:64 bit 33 MHz RedHat 7.1 Kernel 2.4.14
1400 bytes sent 11 us spacing Signals clean ~9.3us on send PCI bus PCI bus ~82%
occupancy ~ 5.9 us on PCI for data
recv.
Slide: 26Richard Hughes-Jones
Network switch limits behaviour End2end UDP packets from udpmon
Only 700 Mbit/s throughput
Lots of packet loss
Packet loss distributionshows throughput limited
w05gva-gig6_29May04_UDP
0
100
200
300
400
500
600
700
800
900
1000
0 5 10 15 20 25 30 35 40Spacing between frames us
Recv W
ire r
ate
Mb
its/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
w05gva-gig6_29May04_UDP
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acke
t lo
ss
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
w05gva-gig6_29May04_UDP wait 12us
0
2000
4000
6000
8000
10000
12000
14000
0 100 200 300 400 500 600Packet No.
1-w
ay d
ela
y u
s
0
2000
4000
6000
8000
10000
12000
14000
500 510 520 530 540 550Packet No.
1-w
ay d
elay
us
Slide: 27Richard Hughes-Jones
10 Gigabit Ethernet
Slide: 28Richard Hughes-Jones
10 Gigabit Ethernet: UDP Throughput 1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length
16080 DataTAG Supermicro PCs Dual 2.2 GHz Xenon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2.9 Gbit/s
CERN OpenLab HP Itanium PCs Dual 1.0 GHz 64 bit Itanium CPU FSB 400
MHz PCI-X mmrbc 4096 bytes wire rate of 5.7 Gbit/s
SLAC Dell PCs giving a Dual 3.0 GHz Xenon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5.4 Gbit/s
an-al 10GE Xsum 512kbuf MTU16114 27Oct03
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30 35 40Spacing between frames us
Rec
v W
ire
rate
Mbi
ts/s
16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
Slide: 29Richard Hughes-Jones
10 Gigabit Ethernet: Tuning PCI-X
16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc
Measured times Times based on PCI-X times
from the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s
mmrbc1024 bytes
mmrbc2048 bytes
mmrbc4096 bytes5.7Gbit/s
mmrbc512 bytes
CSR Access
PCI-X Sequence
Data Transfer
Interrupt & CSR UpdateKernel 2.6.1#17 HP Itanium Intel10GE Feb04
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
DataTAG Xeon 2.2 GHz
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
Slide: 30Richard Hughes-Jones
Different TCP Stacks
Slide: 31Richard Hughes-Jones
Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)
For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½
High Speed TCP
a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner
for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect
is that there is not such a decrease in throughput.
Scalable TCP
a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.
Fast TCP
Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.
Slide: 32Richard Hughes-Jones
Comparison of TCP Stacks TCP Response Function
Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel
MB-NG rtt 6ms DataTAG rtt 120 ms
Slide: 33Richard Hughes-Jones
High Throughput Demonstrations
Manchester (Geneva)
man03lon01
2.5 Gbit SDHMB-NG Core
1 GEth1 GEth
Cisco GSR
Cisco GSR
Cisco7609
Cisco7609
London (Chicago)
Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz
Send data with TCPDrop Packets
Monitor TCP with Web100
Slide: 34Richard Hughes-Jones
Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s
High Performance TCP – MB-NG
Standard HighSpeed Scalable
Slide: 35Richard Hughes-Jones
High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106
High-SpeedRapid recovery
ScalableVery fast recovery
StandardRecovery would
take ~ 20 mins
Slide: 36Richard Hughes-Jones
Disk and RAID Sub-Systems
Based on talk at NFNN 2004
by
Andrew Sansum
Tier 1 Manager at RAL
Slide: 37Richard Hughes-Jones
Reliability and Operation
The performance of a system that is broken is 0.0 MB/S! When staff are fixing broken servers they cannot spend their time
optimising performance. With a lot of systems, anything that can break – will break
Typical ATA failure rate 2-3% per annum at RAL, CERN and CASPUR. That’s 30-45 drives per annum on the RAL Tier1 i.e. One/week. One failure for every 10**15 bits read. MTBF 400K hours.
RAID arrays may not handle block re-maps satisfactorily or take so long that the system has given up. RAID5 not perfect protection – finite chance of 2 disks failing!
SCSI interconnects corrupts data or give CRC errors Filesystems (ext2) corrupt for no apparent cause (EXT2 errors) Silent data corruption happens (checksums are vital).
Fixing data corruptions can take a huge amount of time (> 1 week). Data loss possible.
Slide: 38Richard Hughes-Jones
You need to Benchmark – but how?
Andrew’s golden rule: Benchmark Benchmark Benchmark Choose a benchmark that matches your application (we
use Bonnie++ and IOZONE). Single stream of sequential I/O. Random disk accesses. Multi-stream – 30 threads more typical.
Watch out for caching effects (use large files and small memory).
Use a standard protocol that gives reproducible results. Document what you did for future reference Stick with the same benchmark suite/protocol as long as
possible to gain familiarity. Much easier to spot significant changes
Slide: 39Richard Hughes-Jones
Hard Disk Performance
Factors affecting performance: Seek time (rotation speed is a big component) Transfer Rate Cache size Retry count (vibration)
Watch out for unusual disk behaviours: write/verify (drive verifies writes for first <n> writes) drive spin down etc etc.
Storage review is often worth reading: http://www.storagereview.com/
Slide: 40Richard Hughes-Jones
Impact of Drive Transfer Rate
0
1020
3040
50
6070
80
1 2 4 8 16 32
MB/S7200 Read
5400 Read
A slower drive may not cause significant sequential performance degradation in a RAID array.This 5400 drive was 25% slower than the 7200
Threads
Slide: 41Richard Hughes-Jones
Disk Parameters
From: www.storagereview.com
ms
Distribution of Seek Time
Mean 14.1 ms Published tseek time 9 ms Taken off rotational latency
Head Position & Transfer Rate
Outside of disk has more blocksso faster
58 MB/s to 35 MB/s 40% effect
Position on disk
Slide: 42Richard Hughes-Jones
Drive Interfaces
http://www.maxtor.com/
Slide: 43Richard Hughes-Jones
RAID levels
Choose Appropriate RAID level RAID 0 (stripe) is fastest but no redundancy RAID 1 (mirror) single disk performance RAID 2, 3 and 4 not very common/supported
sometimes worth testing with your application RAID 5 – Read is almost as fast as RAID-0, write
substantially slower. RAID 6 – extra parity info – good for unreliable drives
but could have considerable impact on performance. Rare but potentially very useful for large IDE disk arrays.
RAID 50 & RAID 10 - RAID0 2 (or more) controllers
Slide: 44Richard Hughes-Jones
Kernel Versions
Kernel versions can make an enormous difference. It is not unusual for I/O to be badly broken in Linux. In 2.4 series – Virtual Memory Subsystem problems
only 2.4.5 some 2.4.9 >2.4.17 OK Redhat Enteprise 3 seems badly broken
(although most recent RHE Kernel may be better) 2.6 kernel contains new functionality and may be better
(although first tests show little difference) Always check after an upgrade that performance remains
good.
Slide: 45Richard Hughes-Jones
Kernel Tuning
Tunable kernel parameters can improve I/O performance, but depends what you are trying to achieve. IRQ balancing. Issues with IRQ handling on some Xeon
chipsets/kernel combinations (all IRQs handled on CPU zero). Hyperthreading I/O schedulers – can tune for latency by minimising the seeks
/sbin/elvtune ( number sectors of writes before reads allowed) Choice of scheduler: elevator=xx orders I/O to minimise seeks,
merges adjacent requests. VM on 2.4 tuningBdflush and readahead (single thread helped)
Sysctl –w vm.max-readahead=512Sysctl –w vm.min-readahead=512Sysctl –w vm.bdflush=10 500 0 0 3000 10 20 0
On 2.6 readahead command is [default 256]blockdev --setra 16384 /dev/sda
Slide: 46Richard Hughes-Jones
Example Effect of VM Tuneup
020406080
100120140160180200
1 2 4 8 16 32
Default
Tuned
MB/s
threads
Redhat 7.3 Read Performance
Slide: 47Richard Hughes-Jones
IOZONE Tests
0
10
20
30
40
50
60
70
MB/s
1 2 4 8 16 32
threads
ext2
ext3
Read Performance File System ext2 ext3 Due to journal behaviour 40% effect
0
10
20
30
40
50
60
70
MB/s
1 2 4 8 16 32
threads
0
10
20
30
40
50
60
MB/s
1 2 4 8 16 32
threads
writeback
ordered
journal
ext3 Write ext3 Read Journal write data + write journal
Slide: 48Richard Hughes-Jones
RAID Controller Performance RAID5 (stripped with redundancy) 3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz 3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard Disk: Maxtor 160GB 7200rpm 8MB Cache Read ahead kernel tuning: /proc/sys/vm/max-readahead = 512
Rates for the same PC RAID0 (stripped) Read 1040 Mbit/s, Write 800 Mbit/s
Disk – Memory Read Speeds Memory - Disk Write Speeds
Stephen Dallison
Slide: 49Richard Hughes-Jones
SC2004 RAID Controller Performance Supermicro X5DPE-G2 motherboards loaned from Boston Ltd. Dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus Configured as RAID0 64k byte stripe size Six 74.3 GByte Western Digital Raptor WD740 SATA disks
75 Mbyte/s disk-buffer 150 Mbyte/s buffer-memory Scientific Linux with 2.6.6 Kernel + altAIMD patch (Yee) + packet loss patch Read ahead kernel tuning: /sbin/blockdev --setra 16384 /dev/sda
RAID0 (stripped) 2 GByte file: Read 1500 Mbit/s, Write 1725 Mbit/s
Disk – Memory Read Speeds Memory - Disk Write Speeds
RAID0 6disks Read 3w8506-8
0
2000
4000
6000
8000
10000
12000
0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0
Buffer size kbytes
Thro
ughp
ut M
bit/s
1 G Byte Read
0.5 G Byte Read
2 G Byte Read
RAID0 6disks Write 3w8506-8 16
0
500
1000
1500
2000
2500
0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0
Buffer size kbytes
Thro
ughp
ut M
bit/s
1 G Byte write
0.5 G Byte write
2 G Byte write
Slide: 50Richard Hughes-Jones
Applications
Throughput for Real Users
Slide: 51Richard Hughes-Jones
Topology of the MB – NG Network
KeyGigabit Ethernet2.5 Gbit POS Access
MPLS Admin. Domains
UCL Domain
Edge Router Cisco 7609
man01
man03
Boundary Router Cisco 7609
Boundary Router Cisco 7609
RAL Domain
Manchester Domain
lon02
man02
ral01
UKERNADevelopment
Network
Boundary Router Cisco 7609
ral02
ral02
lon03
lon01
HW RAID
HW RAID
Slide: 52Richard Hughes-Jones
Gridftp Throughput + Web100 RAID0 Disks:
960 Mbit/s read 800 Mbit/s write
Throughput Mbit/s:
See alternate 600/800 Mbit and zero Data Rate: 520 Mbit/s
Cwnd smooth No dup Ack / send stall /
timeouts
Slide: 53Richard Hughes-Jones
http data transfers HighSpeed TCP
Same Hardware Bulk data moved by web servers Apachie web server
out of the box! prototype client - curl http library 1Mbyte TCP buffers 2Gbyte file Throughput ~720 Mbit/s Cwnd - some variation No dup Ack / send stall / timeouts
Slide: 54Richard Hughes-Jones
bbftp: What else is going on? Scalable TCP
BaBar + SuperJANET
SuperMicro + SuperJANET
Congestion window – duplicate ACK
Variation not TCP related? Disk speed / bus transfer Application
Slide: 55Richard Hughes-Jones
SC2004 UKLIGHT Topology
MB-NG 7600 OSRManchester
ULCC UKLight
UCL HEP
UCL network
K2
Ci
Chicago Starlight
Amsterdam
SC2004
Caltech BoothUltraLight IP
SLAC Booth
Cisco 6509
UKLight 10GFour 1GE channels
UKLight 10G
Surfnet/ EuroLink 10GTwo 1GE channels
NLR LambdaNLR-PITT-STAR-10GE-16
K2
K2 Ci
Caltech 7600
Slide: 56Richard Hughes-Jones
SC2004 Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:
Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)
Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s
~4.5s of overhead)
Disk-TCP-Disk at 1Gbit/s
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
chiv
e M
bit
/s050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBW
AveBW
CurCwnd (Value)
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
chiv
e M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBWAveBWCurCwnd (Value)
Slide: 57Richard Hughes-Jones
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
Network & Disk Interactions (work in progress) Hosts:
Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
Measure memory to RAID0 transfer rates with & without UDP trafficRAID0 6disks 1 Gbyte Write 64k 3w8506-8
y = -1.017x + 178.32
y = -1.0479x + 174.440
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k
64k
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
R0 6d 1 Gbyte udp9000 write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k64ky=178-1.05x
R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k
64k
y=178-1.05x
Disk write1735 Mbit/s
Disk write +1500 MTU UDP
1218 Mbit/sDrop of 30%
Disk write +9000 MTU UDP
1400 Mbit/sDrop of 19%
% CPU kernel mode
Slide: 58Richard Hughes-Jones
Remote Processing Farms: Manc-CERN
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time
Dat
a B
ytes
Ou
t
0
50
100
150
200
250
300
350
400
Dat
a B
ytes
In
DataBytesOut (Delta DataBytesIn (Delta Round trip time 20 ms
64 byte Request green1 Mbyte Response blue
TCP in slow start 1st event takes
19 rtt or ~ 380 ms
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
Dat
a B
ytes
Ou
t
0
50000
100000
150000
200000
250000
Cu
rCw
nd
DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value
TCP Congestion windowgets re-set on each Request
TCP stack implementation detail to reduce Cwnd after inactivity
Even after 10s, each response takes 13 rtt or ~260 ms
020406080
100120140160180
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
TC
PA
chiv
e M
bit
/s
0
50000
100000
150000
200000
250000
Cw
nd
Transfer achievable throughput120 Mbit/s
●●●
Slide: 59Richard Hughes-Jones
Host is critical: Motherboards NICs, RAID controllers and Disks matter The NICs should be well designed:
NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI can be OK)NIC/drivers: CSR access / Clean buffer management / Good interrupt
handling Worry about the CPU-Memory bandwidth as well as the PCI bandwidth
Data crosses the memory bus at least 3 times Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses
Test Disk Systems with representative Load Choose a modern high throughput RAID controller
Consider SW RAID0 of RAID5 HW controllers Need plenty of CPU power for sustained 1 Gbit/s transfers and disk access Packet loss is a killer
Check on campus links & equipment, and access links to backbones New stacks are stable give better response & performance
Still need to set the tcp buffer sizes & other kernel settings e.g. window-scale Application architecture & implementation is important Interaction between HW, protocol processing, and disk sub-system complex
Summary, Conclusions & Thanks
MB - NG
Slide: 60Richard Hughes-Jones
More Information Some URLs
UKLight web site: http://www.uklight.ac.uk DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/ (Software & Tools) Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications)
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004http:// www.hep.man.ac.uk/~rich/ (Publications)
PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002 Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time Disk information http://www.storagereview.com/
Slide: 61Richard Hughes-Jones
Any Questions?
Slide: 62Richard Hughes-Jones
Backup Slides
Slide: 63Richard Hughes-Jones
TCP (Reno) – Details Time for TCP to recover its throughput from 1 lost packet given by:
for rtt of ~200 ms:
MSS
RTTC
*2
* 2
2 min
0.00010.0010.010.1
110
1001000
10000100000
0 50 100 150 200rtt ms
Tim
e to
rec
ove
r se
c
10Mbit100Mbit1Gbit2.5Gbit10Gbit
UK 6 ms Europe 20 ms USA 150 ms
Slide: 64Richard Hughes-Jones
Packet Loss and new TCP Stacks TCP Response Function
UKLight London-Chicago-London rtt 180 ms 2.6.6 Kernel
Agreement withtheory good
sculcc1-chi-2 iperf 13Jan05
1
10
100
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vabl
e th
roug
hput
M
bit/s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
A0 Theory
Series10
Scalable Theory
sculcc1-chi-2 iperf 13Jan05
0
100
200
300
400
500
600
700
800
900
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vabl
e th
roug
hput
M
bit/s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
Slide: 65Richard Hughes-Jones
Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms)
Used iperf/TCP and UDT/UDP to generate traffic
Each run was 16 minutes, in 7 regions
Test of TCP Sharing: Methodology (1Gbit/s)
Ping 1/s
Iperf or UDT
ICMP/ping traffic
TCP/UDPbottleneck
iperf
SLACCaltech/UFL/CERN
2 mins 4 mins
Les Cottrell PFLDnet
2005
Slide: 66Richard Hughes-Jones
Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in
congestion) Net effect: recovers slowly, does not effectively use available bandwidth,
so poor throughput Unequal sharing
TCP Reno single stream
Congestion has a dramatic effect
Recovery is slow
Increase recovery rate
SLAC to CERN
RTT increases when achieves best throughput
Les Cottrell PFLDnet 2005
Remaining flows do not take up slack when flow removed
Slide: 67Richard Hughes-Jones
Average Transfer Rates Mbit/sApp TCP Stack SuperMicro on
MB-NGSuperMicro on
SuperJANET4
BaBar on
SuperJANET4
SC2004 on UKLight
Iperf Standard 940 350-370 425 940
HighSpeed 940 510 570 940
Scalable 940 580-650 605 940
bbcp Standard 434 290-310 290
HighSpeed 435 385 360
Scalable 432 400-430 380
bbftp Standard 400-410 325 320 825
HighSpeed 370-390 380
Scalable 430 345-532 380 875
apache Standard 425 260 300-360
HighSpeed 430 370 315
Scalable 428 400 317
Gridftp Standard 405 240
HighSpeed 320
Scalable 335
New stacksgive more
throughput
Rate decreases
Slide: 68Richard Hughes-Jones
iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)
BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits
Slide: 69Richard Hughes-Jones
Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET
bbcp
bbftp
Apachie
Gridftp
Previous work used RAID0(not disk limited)
Slide: 70Richard Hughes-Jones
bbcp & GridFTP Throughput 2Gbyte file transferred RAID5 - 4disks Manc – RAL bbcp Mean 710 Mbit/s
GridFTP See many zeros
Mean ~710
Mean ~620
DataTAG altAIMD kernel in BaBar & ATLAS
Slide: 71Richard Hughes-Jones
tcpdump of the T/DAQ dataflow at SFI (1)Cern-Manchester 1.0 Mbyte event
Remote EFD requests event from SFI
Incoming event request
Followed by ACK
N 1448 byte packets
SFI sends event
Limited by TCP receive buffer
Time 115 ms (~4 ev/s)
When TCP ACKs arrive
more data is sent.
●●●