VMworld 2016: vSphere 6.x Host Resource Deep Dive

Post on 16-Apr-2017

1.424 views 1 download

Transcript of VMworld 2016: vSphere 6.x Host Resource Deep Dive

vSphere 6.x Host Resource Deep DiveFrank DennemanNiels Hagoort

INF8430

#INF8430

Agenda• Compute• Storage•Network•Q&A

Introduction

www.cloudfix.n l

Niels Hagoort• Independent Archi tect• VMware VCDX #212• VMware vExpert (NSX)

Frank Denneman• Enjoying Summer 2016• VMware VCDX #29• VMware vExpert

www.frankdenneman.nl

Compute( N U M A , N U M A , N U M A )

Insights In Virtual Data Centers

Modern dual sockets CPU servers are Non-Uniform Memory Access (NUMA) systems

Local and Remote Memory

NUMA Focus Points

• Caching Snoop modes

• DIMM configuration

• Size VM match CPU topology

CPU Cache( t h e f o r g o t t e n h e r o )

CPU Architecture

Caching Snoop Modes

DIMM Configuration( a n d w h y 3 8 4 G B i s n o t a n o p t i m a l c o n fi g u r a t i o n )

Memory Constructs

3-DPC - 384 GB – 2400 MHz DIMM

DIMMS Per Channel

2-DPC - 384 GB – 2400 MHz DIMM

Current Sweet Spot: 512GB

Right Size your VM( A l i g n m e n t e q u a l s c o n s i s t e n t p e r f o r m a n c e )

ESXi NUMA focus points

• CPU scheduler al locates Core or HT cycles

• NUMA scheduler init ial placement + LB

• vCPU configuration impacts IP & LB

Scheduling constructs

12 vCPU On 20 Core System

Align To CPU Topology

• Resize vCPU configuration to match core count

• Use vcpu.numa.preferHT

• Use cores per socket (CORRECTLY)

• Attend INF8089 at 5 PM in this room

Prefer HT + 12 Cores Per Socket

Storage( H o w f a r a w a y i s y o u r d a t a ? )

The Importance of Access LatencyLocation of operands CPU Cycles Perspective

CPU Register 1 Brain (Nanosecond)

L1/L3 cache 10 End of this room

Local Memory 100 Entrance of building

Disk 10^6 New York

Every Layer = CPU Cycles & Latency

Industry Moves Toward NVMe• SSD bandwidth capabil i t ies exceeds current

control ler bandwidth

• Protocol inefficiencies dominant contributor to access t ime

• NVMe architected from the ground up for non-volati le memory

I/O Queue Per CPU

Driver Stack

Not All Drivers Are Created Equal

Network

pNIC considerations for VXLAN performance

• Additional layer of packet processing

• Consumes CPU cycles for each packet for encapsulation/de-capsulation

• Some of the offload capabil i t ies of the NIC cannot be used (TCP based)

• VXLAN offloading! (TSO / CSO)

VXLAN

1.

2.

3.

[root@ESXi02:~] vmkload_mod -s bnx2xvmkload_mod module information input file: /usr/lib/vmware/vmkmod/bnx2x Version: Version 1.78.80.v60.12, Build: 2494585, Interface: 9.2 Built on: Feb 5 2015 Build Type: release License: GPL Name-space: com.broadcom.bnx2x#9.2.3.0 Required name-spaces: com.broadcom.cnic_register#9.2.3.0 com.vmware.driverAPI#9.2.3.0 com.vmware.vmkapi#v2_3_0_0 Parameters: skb_mpool_max: int Maximum attainable private socket buffer memory pool size for the driver. skb_mpool_initial: int Driver's minimum private socket buffer memory pool size. heap_max: int Maximum attainable heap size for the driver. heap_initial: int Initial heap size allocated for the driver. disable_feat_preemptible: int For debug purposes, disable FEAT_PREEMPTIBLE when set to value of 1 disable_rss_dyn: int For debug purposes, disable RSS_DYN feature when set to value of 1 disable_fw_dmp: int For debug purposes, disable firmware dump feature when set to value of 1 enable_vxlan_ofld: int Allow vxlan TSO/CSO offload support.[Default is disabled, 1: enable vxlan offload, 0: disable vxlan offload] debug_unhide_nics: int Force the exposure of the vmnic interface for debugging purposes[Default is to hide the nics]1. In SRIOV mode expose the PF enable_default_queue_filters: int Allow filters on the default queue. [Default is disabled for non-NPAR mode, enabled by default on NPAR mode] multi_rx_filters: int Define the number of RX filters per NetQueue: (allowed values: -1 to Max # of RX filters per NetQueue, -1: use the default number of RX filters; 0: Disable use of multiple RX filters; 1..Max # the number of RX filters per NetQueue: will force the number of RX filters to use for NetQueue........

[root@ESXi01:~] esxcli system module parameters list -m bnx2xName Type Value Description ---------------------------- ---- ----- -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------RSS int Control the number of queues in an RSS pool. Max 4. autogreeen uint Set autoGrEEEn (0:HW default; 1:force on; 2:force off) debug uint Default debug msglevel debug_unhide_nics int Force the exposure of the vmnic interface for debugging purposes[Default is to hide the nics]1. In SRIOV mode expose the PF disable_feat_preemptible int For debug purposes, disable FEAT_PREEMPTIBLE when set to value of 1 disable_fw_dmp int For debug purposes, disable firmware dump feature when set to value of 1 disable_iscsi_ooo uint Disable iSCSI OOO support disable_rss_dyn int For debug purposes, disable RSS_DYN feature when set to value of 1 disable_tpa uint Disable the TPA (LRO) feature dropless_fc uint Pause on exhausted host ring eee set EEE Tx LPI timer with this value; 0: HW default enable_default_queue_filters int Allow filters on the default queue. [Default is disabled for non-NPAR mode, enabled by default on NPAR mode] enable_vxlan_ofld int Allow vxlan TSO/CSO offload support.[Default is disabled, 1: enable vxlan offload, 0: disable vxlan offload] gre_tunnel_mode uint Set GRE tunnel mode: 0 - NO_GRE_TUNNEL; 1 - NVGRE_TUNNEL; 2 - L2GRE_TUNNEL; 3 - IPGRE_TUNNEL gre_tunnel_rss uint Set GRE tunnel RSS mode: 0 - GRE_OUTER_HEADERS_RSS; 1 - GRE_INNER_HEADERS_RSS; 2 - NVGRE_KEY_ENTROPY_RSS heap_initial int Initial heap size allocated for the driver. heap_max int Maximum attainable heap size for the driver. int_mode uint Force interrupt mode other than MSI-X (1 INT#x; 2 MSI) max_agg_size_param uint max aggregation size mrrs int Force Max Read Req Size (0..3) (for debug) multi_rx_filters int Define the number of RX filters per NetQueue: (allowed values: -1 to Max # of RX filters per NetQueue, -1: use the default number of RX filters; 0: Disable use of multiple RX filters; 1..Max # the number of RX filters per NetQueue: will force the number of RX filters to use for NetQueuenative_eee uint num_queues uint Set number of queues (default is as a number of CPUs) num_rss_pools int Control the existence of a RSS pool. When 0,RSS pool is disabled. When 1, there will bea RSS pool (given that RSS > 0). ........

• Check the supported features of your pNIC

• Check the HCL for supported features in the driver module

• Check the driver module; does it requires you to enable features?

• Other async (vendor) driver avai lable?

Driver Summary

RSS & NetQueue • NIC support required (RSS / VMDq) • VMDq is the hardware feature, NetQueue is

the feature baked into vSphere

• RSS & NetQueue similar in basic functional ity

• RSS uses hashes based on IP/TCP port/MAC

• NetQueue uses MAC filters

Without RSS for VXLAN (1 thread per pNIC)

RSS enabled (>1 threads per pNIC)

How to enable RSS (Intel)

1. Unload module: esxcfg-module -u ixgbe

2. Enable inbox: vmkload_mod ixgbe RSS="4,4”

Enable async: vmkload_mod ixgbe RSS=“1,1”

Receive throughput with VXLAN using 10GbE

Intel examples:Intel Ethernet products RSS for VXLAN technologyIntel Ethernet X520/540 series Scale RSS on VXLAN Outer UDP informationIntel Ehternet X710 series Scale RSS on VXLAN Inner or Outer header

information

X710 series = better at balancing over queues > CPU threads

“What is the maximum performance of the vSphere (D)vSwitch?”

• By default one transmit (Tx) thread per VM• By default, one receive (Netpol l) thread per

pNIC• Transmit (Tx) and receive (Netpol l) threads

consume CPU cycles• Each additional thread provides capacity

(1 thread = 1 core)

Network IO CPU consumption

Netpoll Thread

%SYS i s ±100% dur ing tes t . pN IC rece ives .( th i s i s the NETPOLL th read )

NetQueue Scaling

{"name": "vmnic0", "switch": "DvsPortset-0", "id": 33554435, "mac": "38:ea:a7:36:78:8c", "rxmode": 0, "uplink": "true", "txpps": 247, "txmbps": 9.4, "txsize": 4753, "txeps": 0.00, "rxpps": 624291, "rxmbps": 479.9, "rxsize": 96, "rxeps": 0.00,"wdt": [ {"used": 0.00, "ready": 0.00, "wait": 41.12, "runct": 0, "remoteactct": 0, "migct": 0, "overrunct": 0, "afftype": "pcpu", "affval": 39, "name": "242.vmnic0-netpoll-10"}, {"used": 0.00, "ready": 0.00, "wait": 41.12, "runct": 0, "remoteactct": 0, "migct": 0, "overrunct": 0, "afftype": "pcpu", "affval": 39, "name": "243.vmnic0-netpoll-11"}, {"used": 82.56, "ready": 0.49, "wait": 16.95, "runct": 8118, "remoteactct": 1, "migct": 9, "overrunct": 33, "afftype": "pcpu", "affval": 45, "name": "244.vmnic0-netpoll-12"}, {"used": 18.71, "ready": 0.75, "wait": 80.54, "runct": 6494, "remoteactct": 0, "migct": 0, "overrunct": 0, "afftype": "vcpu", "affval": 19302041, "name": "245.vmnic0-netpoll-13"}, {"used": 55.64, "ready": 0.55, "wait": 43.81, "runct": 7491, "remoteactct": 0, "migct": 4, "overrunct": 5, "afftype": "vcpu", "affval": 19299346, "name": "246.vmnic0-netpoll-14"}, {"used": 0.14, "ready": 0.10, "wait": 99.48, "runct": 197, "remoteactct": 6, "migct": 6, "overrunct": 0, "afftype": "vcpu", "affval": 19290577, "name": "247.vmnic0-netpoll-15"}, {"used": 0.00, "ready": 0.00, "wait": 0.00, "runct": 0, "remoteactct": 0, "migct": 0, "overrunct": 0, "afftype": "pcpu", "affval": 45, "name": "1242.vmnic0-0-tx"}, {"used": 0.00, "ready": 0.00, "wait": 0.00, "runct": 0, "remoteactct": 0, "migct": 0, "overrunct": 0, "afftype": "pcpu", "affval": 22, "name": "1243.vmnic0-1-tx"}, {"used": 0.00, "ready": 0.00, "wait": 0.00, "runct": 0, "remoteactct": 0, "migct": 0, "overrunct": 0, "afftype": "pcpu", "affval": 24, "name": "1244.vmnic0-2-tx"}, {"used": 0.00, "ready": 0.00, "wait": 0.00, "runct": 0, "remoteactct": 0, "migct": 0, "overrunct": 0, "afftype": "pcpu", "affval": 39, "name": "1245.vmnic0-3-tx"} ],

3 Ne tPo l l th reads a re used (3 word le t s ) .

Tx Thread

PKTGEN i s po l l i ng , consuming near 100% CPU

%SYS = ±100%Th is i s the Tx th read

• VMXNET3 is required!• example for vNIC2:

ethernet2.ctxPerDev = "1“

Additional Tx Thread

Additional Tx thread

%SYS = ±200%CPU th reads i n same NUMA node as VM

{"name": "pktgen_load_test21.eth0", "switch": "DvsPortset-0", "id": 33554619, "mac": "00:50:56:87:10:52", "rxmode": 0, "uplink": "false", "txpps": 689401, "txmbps": 529.5, "txsize": 96, "txeps": 0.00, "rxpps": 609159, "rxmbps": 467.8, "rxsize": 96, "rxeps": 54.09, "wdt": [ {"used": 99.81, "ready": 0.19, "wait": 0.00, "runct": 1176, "remoteactct": 0, "migct": 12, "overrunct": 1176, "afftype": "vcpu", "affval": 15691696, "name": "323.NetWdt-Async-15691696"}, {"used": 99.85, "ready": 0.15, "wait": 0.00, "runct": 2652, "remoteactct": 0, "migct": 12, "overrunct": 12, "afftype": "vcpu", "affval": 15691696, "name": "324.NetWorldlet-Async-33554619"} ],

2 w o r l d l e t s

• Transmit (Tx) and receive (Netpol l) threads can be scaled!

• Take the extra CPU cycles for network IO into account!

Summary

Q&A

Keep an eye out for our upcoming book!

@frankdenneman@NHagoort

@frankdenneman@NHagoort