How deep is your buffer – Demystifying buffers and application performance

14
March 14, 2017 JR Rivers | Co-founder/CTO A JOURNEY TO DEEPER UNDERSTANDING Network DataPath

Transcript of How deep is your buffer – Demystifying buffers and application performance

Page 1: How deep is your buffer – Demystifying buffers and application performance

1

March 14, 2017

JR Rivers | Co-founder/CTO

A JOURNEY TO DEEPER UNDERSTANDING

Network DataPath

Page 2: How deep is your buffer – Demystifying buffers and application performance

2

How Much Buffer – the take away

If the last bit of performance matters to you, do the testing§ be careful of what you read

If not, take solace……the web-scales use “small buffer” switches

Network Data Path

Page 3: How deep is your buffer – Demystifying buffers and application performance

3

Tools and Knobs – Show and Tell

Network Data Path

cumulus@server02:~$ lscpuArchitecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little EndianCPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2Core(s) per socket: 4Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntelCPU family: 6 Model: 26 Model name: Intel(R) Xeon(R) CPU L5520 @ 2.27GHz Stepping: 5 CPU MHz: 1600.000 CPU max MHz: 2268.0000 CPU min MHz: 1600.0000 BogoMIPS: 4441.84 Virtualization: VT-xL1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-15

Internet

25GE attached servers, 100G interconnect

server01

server02

leaf01

server03

server04

leaf03

edge01

exit01

spine01

oob-mgmt-server

oob-mgmt-switch

100G

25G

Link Under Test

Page 4: How deep is your buffer – Demystifying buffers and application performance

4

Tools and Knobs - iperf3

Network Data Path

cumulus@server01:~$ iperf3 -c rack-edge01 -p 5201 -t 30Connecting to host rack-edge01, port 5201[ 4] local 10.0.1.1 port 34912 connected to 10.0.3.1 port 5201[ ID] Interval Transfer Bandwidth Retr Cwnd[ 4] 0.00-1.00 sec 2.13 GBytes 18.3 Gbits/sec 433 888 KBytes[ 4] 1.00-2.00 sec 2.74 GBytes 23.5 Gbits/sec 0 888 KBytes[ 4] 2.00-3.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1020 KBytes[ 4] 3.00-4.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1020 KBytes[ 4] 4.00-5.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.01 MBytes[ 4] 5.00-6.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.02 MBytes[ 4] 6.00-7.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.16 MBytes[ 4] 7.00-8.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.45 MBytes[ 4] 8.00-9.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes[ 4] 9.00-10.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes[ 4] 10.00-11.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes[ 4] 11.00-12.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes[ 4] 12.00-13.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes[ 4] 13.00-14.00 sec 2.73 GBytes 23.5 Gbits/sec 0 1.57 MBytes[ 4] 14.00-15.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes[ 4] 15.00-16.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes[ 4] 16.00-17.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes[ 4] 17.00-18.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes[ 4] 18.00-19.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes[ 4] 19.00-20.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes[ 4] 20.00-21.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes[ 4] 21.00-22.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes[ 4] 22.00-23.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes[ 4] 23.00-24.00 sec 2.72 GBytes 23.4 Gbits/sec 1 1.76 MBytes[ 4] 24.00-25.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.76 MBytes[ 4] 25.00-26.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.76 MBytes[ 4] 26.00-27.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.76 MBytes[ 4] 27.00-28.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes[ 4] 28.00-29.00 sec 2.65 GBytes 22.8 Gbits/sec 0 1.76 MBytes[ 4] 29.00-30.00 sec 2.73 GBytes 23.5 Gbits/sec 0 1.76 MBytes- - - - - - - - - - - - - - - - - - - - - - - - -[ ID] Interval Transfer Bandwidth Retr[ 4] 0.00-30.00 sec 81.3 GBytes 23.3 Gbits/sec 434 sender[ 4] 0.00-30.00 sec 81.3 GBytes 23.3 Gbits/sec receiver

iperf Done.

top - 17:10:44 up 21:55, 2 users, load average: 0.21, 0.07, 0.02Tasks: 216 total, 1 running, 215 sleeping, 0 stopped, 0 zombie%Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu4 : 0.7 us, 30.8 sy, 0.0 ni, 67.9 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st%Cpu5 : 0.0 us, 4.0 sy, 0.0 ni, 95.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st%Cpu6 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu8 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu9 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu11 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st%Cpu15 : 0.4 us, 41.6 sy, 0.0 ni, 46.9 id, 0.0 wa, 0.0 hi, 11.1 si, 0.0 stKiB Mem : 74224280 total, 73448200 free, 498208 used, 277872 buff/cacheKiB Swap: 75486208 total, 75486208 free, 0 used. 73183560 avail Mem

Note - bandwidth is reported in TCP payload, so 23.5 Gbits/sec is wire-speed 25G Ethernet

Page 5: How deep is your buffer – Demystifying buffers and application performance

5

Tools and Knobs – tcpdump

Network Data Path

cumulus@edge01:~/pcaps$ sudo tcpdump -i enp4s0f1 -w single.pcap tcp port 5201tcpdump: listening on enp4s0f1, link-type EN10MB (Ethernet), capture size 262144 bytes1098 packets captured1098 packets received by filter0 packets dropped by kernel

cumulus@server01:~$ iperf3 -c rack-edge01 -p 5201 -t 2 -b 50MConnecting to host rack-edge01, port 5201[ 4] local 10.0.1.1 port 34948 connected to 10.0.3.1 port 5201[ ID] Interval Transfer Bandwidth Retr Cwnd[ 4] 0.00-1.00 sec 5.46 MBytes 45.8 Mbits/sec 21 109 KBytes[ 4] 1.00-2.00 sec 5.88 MBytes 49.3 Mbits/sec 29 70.7 KBytes- - - - - - - - - - - - - - - - - - - - - - - - -[ ID] Interval Transfer Bandwidth Retr[ 4] 0.00-2.00 sec 11.3 MBytes 47.5 Mbits/sec 50 sender[ 4] 0.00-2.00 sec 11.3 MBytes 47.5 Mbits/sec receiveriperf Done.

cumulus@edge01:~/pcaps$ tcpdump -r single.pcapreading from file single.pcap, link-type EN10MB (Ethernet)07:52:57.600873 IP rack-server01.34946 > rack-edge01.5201: Flags [SEW], seq 1655732583, win 29200, options [mss 1460,sackOK,TS val 33182573 ecr 0,nop,wscale 7], length 007:52:57.600900 IP rack-edge01.5201 > rack-server01.34946: Flags [S.E], seq 319971738, ack 1655732584, win 28960, options [mss 1460,sackOK,TS val 56252912 ecr33182573,nop,wscale 7], length 007:52:57.601133 IP rack-server01.34946 > rack-edge01.5201: Flags [.], ack 1, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 007:52:57.601160 IP rack-server01.34946 > rack-edge01.5201: Flags [P.], seq 1:38, ack 1, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 3707:52:57.601169 IP rack-edge01.5201 > rack-server01.34946: Flags [.], ack 38, win 227, options [nop,nop,TS val 56252912 ecr 33182573], length 007:52:57.601213 IP rack-edge01.5201 > rack-server01.34946: Flags [P.], seq 1:2, ack 38, win 227, options [nop,nop,TS val 56252912 ecr 33182573], length 107:52:57.601412 IP rack-server01.34946 > rack-edge01.5201: Flags [.], ack 2, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 007:52:57.601419 IP rack-server01.34946 > rack-edge01.5201: Flags [P.], seq 38:42, ack 2, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 407:52:57.640098 IP rack-edge01.5201 > rack-server01.34946: Flags [.], ack 42, win 227, options [nop,nop,TS val 56252922 ecr 33182573], length 0...

Need to make sure your data sources and pcap filters don’t allow drops!!!!

Page 6: How deep is your buffer – Demystifying buffers and application performance

6

Tools and Knobs - wireshark

Network Data Path

Page 7: How deep is your buffer – Demystifying buffers and application performance

7

Tools and Knobs – tcpprobe

Network Data Path

Column Contents

1 Kernel Timestamp

2 Source_IP:port

3 Destination_IP:port

4 Packet Length

5 Send Next

6 Send Unacknowledged

7 Send Congestion Window

8 Slow Start Threshold

9 Send Window

10 Smoothed RTT

11 Receive Window

cumulus@server01:~$ sudo modprobe tcp_probe port=5201 full=1cumulus@server01:~$ sudo chmod oug+r /proc/net/tcpprobecumulus@server01:~$ cat /proc/net/tcpprobe > /tmp/tcpprobe.out &[1] 6921 cumulus@server01:~$ iperf3 -c edge01-hs -t 5...snip...[ ID] Interval Transfer Bandwidth Retr[ 4] 0.00-5.00 sec 13.0 GBytes 22.2 Gbits/sec 538 sender [ 4] 0.00-5.00 sec 12.9 GBytes 22.2 Gbits/sec receiver iperf Done.cumulus@server01:~$ kill 6921cumulus@server01:~$ head 10 /tmp/tcpprobe.out==> /tmp/tcpprobe.out <==4.111198452 10.0.0.2:45520 10.0.0.5:5201 32 0x358a629a 0x3589f17a 20 2147483647 57984 142 293124.111461826 10.0.0.2:45520 10.0.0.5:5201 32 0x358ad962 0x358a629a 21 20 115840 161 293124.111731474 10.0.0.2:45520 10.0.0.5:5201 32 0x358b55d2 0x358ad962 22 20 171648 173 293124.112000993 10.0.0.2:45520 10.0.0.5:5201 44 0x358bd7ea 0x358b55d2 23 20 170880 185 293124.112037126 10.0.0.2:45520 10.0.0.5:5201 32 0x358c107a 0x358b55d2 16 16 225920 195 293124.112260554 10.0.0.2:45520 10.0.0.5:5201 44 0x358c5faa 0x358c1622 17 16 275200 188 293124.112278958 10.0.0.2:45520 10.0.0.5:5201 32 0x358c983a 0x358c1622 23 20 275200 188 293124.112533754 10.0.0.2:45520 10.0.0.5:5201 32 0x358ced12 0x358c326a 16 16 338944 202 293124.112842106 10.0.0.2:45520 10.0.0.5:5201 44 0x358d63da 0x358d03b2 17 16 396800 202 293124.112854569 10.0.0.2:45520 10.0.0.5:5201 32 0x358d63da 0x358d03b2 23 20 396800 202 29312

Note that the smoothed RTT is ~200 usec with no traffic!!!!!

Page 8: How deep is your buffer – Demystifying buffers and application performance

8

Tools and Knobs – TCP congestion algorithms and socket stats

Network Data Path

cumulus@server01:~$ ls /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp*/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_bic.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_cdg.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_dctcp.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_diag.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_highspeed.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_htcp.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_hybla.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_illinois.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_lp.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_probe.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_scalable.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_vegas.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_veno.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_westwood.ko/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_yeah.kocumulus@server01:~$ cat /proc/sys/net/ipv4/tcp_congestion_controlcubic

cumulus@server01:~$ ss --tcp --info dport = 5201State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 2480400 10.0.0.2:45524 10.0.0.5:5201

cubic wscale:7,7 rto:204 rtt:0.137/0.008 mss:1448 cwnd:450 ssthresh:336 bytes_acked:25460316350 segs_out:17583731 segs_in:422330 send 38049.6Mbps lastrcv:122325132 unacked:272 retrans:0/250 reordering:86 rcv_space:29200

Linux default since 2.6.19

param value

wscale 7,7

rto 204

rtt 0.137/0.008

mss 1448

cwnd 450

ssthresh 336

bytes_acked 25460316350

segs_out 17583731

segs_in 422330

send 38049.6Mbps

lastrcv 122325132

unacked 272

retrans 0/250

reordering 86

rcv_space 29200

Page 9: How deep is your buffer – Demystifying buffers and application performance

9

Tools and Knobs – NIC Tuning

Network Data Path

cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_sacknet.ipv4.tcp_sack = 1cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.netdev_max_backlognet.core.netdev_max_backlog = 25000cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.rmem_maxnet.core.rmem_max = 4194304cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.wmem_maxnet.core.wmem_max = 4194304cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.rmem_defaultnet.core.rmem_default = 4194304cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.wmem_defaultnet.core.wmem_default = 4194304cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_rmemnet.ipv4.tcp_rmem = 4096 87380 4194304cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_wmemnet.ipv4.tcp_wmem = 4096 65536 4194304cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_low_latencynet.ipv4.tcp_low_latency = 1cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_adv_win_scalenet.ipv4.tcp_adv_win_scale = 1

http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

Page 10: How deep is your buffer – Demystifying buffers and application performance

10

Tools and Knobs – TCP Tuning

Network Data Path

cumulus@edge01:/proc/sys/net/ipv4$ ls tcp_*tcp_abort_on_overflow tcp_keepalive_probes tcp_reorderingtcp_adv_win_scale tcp_keepalive_time tcp_retrans_collapsetcp_allowed_congestion_control tcp_limit_output_bytes tcp_retries1tcp_app_win tcp_low_latency tcp_retries2tcp_autocorking tcp_max_orphans tcp_rfc1337tcp_available_congestion_control tcp_max_reordering tcp_rmemtcp_base_mss tcp_max_syn_backlog tcp_sacktcp_challenge_ack_limit tcp_max_tw_buckets tcp_slow_start_after_idletcp_congestion_control tcp_mem tcp_stdurgtcp_dsack tcp_min_rtt_wlen tcp_synack_retriestcp_early_retrans tcp_min_tso_segs tcp_syncookiestcp_ecn tcp_moderate_rcvbuf tcp_syn_retriestcp_ecn_fallback tcp_mtu_probing tcp_thin_dupacktcp_fack tcp_no_metrics_save tcp_thin_linear_timeoutstcp_fastopen tcp_notsent_lowat tcp_timestampstcp_fastopen_key tcp_orphan_retries tcp_tso_win_divisortcp_fin_timeout tcp_pacing_ca_ratio tcp_tw_recycletcp_frto tcp_pacing_ss_ratio tcp_tw_reusetcp_fwmark_accept tcp_probe_interval tcp_window_scalingtcp_invalid_ratelimit tcp_probe_threshold tcp_wmemtcp_keepalive_intvl tcp_recovery tcp_workaround_signed_windows

tcp_ecn - INTEGERControl use of Explicit Congestion Notification (ECN) by TCP.ECN is used only when both ends of the TCP connection indicatesupport for it. This feature is useful in avoiding losses dueto congestion by allowing supporting routers to signalcongestion before having to drop packets.Possible values are:

0 Disable ECN. Neither initiate nor accept ECN.1 Enable ECN when requested by incoming connections and

also request ECN on outgoing connection attempts.2 Enable ECN when requested by incoming connections

but do not request ECN on outgoing connections.Default: 2

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

Page 11: How deep is your buffer – Demystifying buffers and application performance

11

Live Action Time!!!!

Page 12: How deep is your buffer – Demystifying buffers and application performance

12

Tools and Knobs – What’s next for me

Find/write a good ”mice” traffic generator§ modify iperf3 to include mean-time-to-completion with blocks

DCTCP with both ECN and Priority Flow Control§ High performance fabrics combine end-to-end congestion

management and lossless linksInfiniband, Fibre Channel, PCIe, NumaLink, etc

Network Data Path

Page 13: How deep is your buffer – Demystifying buffers and application performance

13

How Much Buffer – the take away

If the last bit of performance matters to you, do the testing§ be careful of what you read

If not, take solace……the web-scales use “small buffer” switches

Network Data Path

Page 14: How deep is your buffer – Demystifying buffers and application performance

14

Thank you!Visit us at cumulusnetworks.com or follow us @cumulusnetworks

© 2017 Cumulus Networks. Cumulus Networks, the Cumulus Networks Logo, and Cumulus Linux are trademarks or registered trademarks of Cumulus Networks, Inc. or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. The registered trademark

Linux® is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis.