Dev Conf 2017 - Meeting nfv networking requirements
-
Upload
flavio-leitner -
Category
Engineering
-
view
48 -
download
4
Transcript of Dev Conf 2017 - Meeting nfv networking requirements
Meeting Networking Requirements for NFV
Flavio Bruno LeitnerPrincipal Software Engineer - Networking Service TeamJanuary 2017
● NFV concepts and goals● NFV requirements● 10G Ethernet● Physical-Virtual-Physical (PVP) scenario● Some network solutions● Dive into DPDK enabled Open vSwitch● Possible improvements
Agenda
2
Virtualize network hardware appliances
NFV - Network Functions Virtualization
3
Virtualization
VM VM VMFirewall
LB
Router
A new product/project needs new networking infrastructure
NFV - Goals
4
Before● Slow Process● High Cost● Less Flexibility
After● Fast Process● Lower Cost● Greater Flexibility
Deploy a new service with a click!
NFV - Networking Requirements
5
VM
Virtualization
=
Low Latency
High Throughput
… with zero packet loss
NFV Requirements - Challenge
6
Worse case: Wirespeed smallest frame
Packet rate: 14.88Mpps (million packets per second)
Ethernet specific: 20 bytes [Inter-frame gap (12) + MAC preamble (8)]
Ethernet frame: 64 bytes [MAC header(14) + Payload(46)]
Minimum Ethernet frame size: 20 + 64 = 84 bytes.
Challenge 10GBit/s
7
How much time per packet?
1 / 14.88Mpps = 67.2 nanoseconds
3GHz CPU => ~200 cycles
Cache Miss => ~32 nanoseconds
L2 Cache Hit => ~10 cycles
L3 Cache hit=> ~36 cycles
Small Budget!
Challenge 10GBit/s - 14.88Mpps
Sources: http://www.intel.co.uk/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdfhttps://people.netfilter.org/hawk/presentations/nfws2014/dp-accel-10G-challenge.pdf8
Networking to Virtual Machines - PVP
9
VM
LogicPort
LogicPort
vSwitchPhysPort
PhysPort
Traffic Generator
● Linux Bridge
● Open vSwitch (OVS)
● SR-IOV
● DPDK Enabled Open vSwitch (OVS-DPDK)
Networking to Virtual Machines
10
● Use the kernel datapath
● NAPI
● Unpredictable latency
● Not SDN ready
● Low throughput: ~1Mpps/core (Phy-to-Phy)
● qemu runs in userspace
Linux Bridge
11
● Use the kernel datapath
● NAPI
● Unpredictable latency
● SDN ready
● Low throughput: ~1Mpps/core
● qemu runs in userspace
Open vSwitch
12
● Low latency
● High throughput
● Bypass the host
● Not SDN friendly - Can’t use a virtual switch in the host
● Physical HW exposed - no abstraction, certification issues/costs
● Migration issues
● Limited number of devices
SR-IOV
13
What is DPDK?
● A set of libraries and drivers for fast packet processing.
● Open Source, BSD License
Usage:
● Receive and send packets within the minimum number of CPU cycles.
What it is not:
● A networking stack
Data Plane Development Kit (DPDK)
14
Consists of APIs, provided through the BSD driver running in userspace, to
configure the devices and their respective queues. In addition, a PMD
accesses the RX and TX descriptors directly without any interrupts to quickly
receive, process and deliver packets in the user’s application.
DPDK - Poll-Mode Drivers
Source: http://dpdk.org/doc/guides/prog_guide/poll_mode_drv.html15
● Open vSwitch kernel module is just a cache managed by userspace.
● DPDK provides the libraries and drivers to RX/TX from userspace.
● Yeah, DPDK enabled Open vSwitch!
● Remember the 14.88Mpps? ~16Mpps/core Phys-to-Phys.
● Cost at least one core 100% busy running the PMD thread.
(power consumption, cooling, wasted cycles)
Open vSwitch + DPDK
16
● Provide network connectivity to Virtual Machines
● Qemu runs in userspace
● Vhost-user interface (TX/RX shared virtqueues)
● Guests can choose between kernel or userspace
● Throughput: ~3.5Mpps/core (default features, PVP, tuned)
● Scales up linearly with multiple parallel streams
● System needs to be carefully tuned
OVS-DPDK for NFV
17
● Poll-Mode Driver thread owns a CPU
● Devices (queues) are distributed between PMD threads
● Each PMD thread will busy loop polling and processing
● Run-To-Completion
● Batching (reduce per packet processing cost)
How does it work?
18
X-Ray Patient: OVS-DPDK PMD Thread
19
Port 1
PMD
FW Plane
DROP
Port 2 Port n
PMD in PVP
20
P1
PMD
FW Plane
P2 L1 L2 VM
LogicPort
LogicPort
vSwitchPhysPort
PhysPort
Traffic Generator
Packet Flow
21
PhysicalNIC (10)
PMD
FW Plane
PhysicalNIC (11)
vhost-user (20)
vhost-user (21)
in_port=10,action=21in_port=20,action=11
Measuring Throughput: Zero Packet Loss
22
Expected:● Constant traffic rate● System is constantly dropping packets● Decrease traffic rate, repeat
Packet Drops: Aim For Weak Spots
23
PhysicalNIC (10)
PMD
FW Plane
PhysicalNIC (11)
vhost-user(20)
vhost-user(21)
in_port=10,action=21in_port=20,action=11
Packet Drops: NIC RX QUEUE
24
PhysicalNIC
PMD
FW Plane
● Fixed sized limited by hardware● Drops are reported in the port stats● Queue overflow
(producer-consumer problem)
Packet Drops: Vhost-user TX Queue
25
PMD
FW Plane
DROP
Guestvhost-user
● Fixed sized limited in software● Drops reported in the guest● Queue overflow
(producer-consumer problem)
Packet Drops: Vhost-user RX Queue
26
PMD
FW Plane
Guestvhost-user
● Fixed sized limited in software● Drops are reported in the port stats● Queue overflow
(producer-consumer problem)
Measuring Throughput: Zero Packet Loss
27
Expected:● Constant traffic rate● System is constantly dropping packets● Decrease traffic rate, repeat
Reality:● System is stable for a period of time● Few packets dropped sporadically● Decrease traffic rate, repeat● Very low throughput● Understand what is causing the drops
Estimating PMD Processing Budget
28
Throughput (Mpps) Proc. Budget (µs) PMD Budget (µs)
3.0
4.0
5.0
6.0
0.33
0.25
0.20
0.16
0.16
0.12
0.10
0.08
Measuring Polling/Processing cost.
29
Device Mode Time (µs)Phys Ingress Polling 0.2
Phys Ingress Processing 3.1
Phys egress Polling 0.016
Phys egress Processing 0
vhost-user ingress Polling 0.013
vhost-user ingress Processing 0
vhost-user egress Polling 0.73
vhost-user egress Processing 2.14
Total Polling+Processing 6.2
● Total of 6.2µs is 24x the per packet budget (0.25µs)
● Assuming 32 packets in a batch, per packet reduces to 0.19µs, ~5Mpps
● 3.5Mpps 0 packet loss (0.29µs) => batch size of 21.4 in average.
Batching
30
● Internal sources
● External sources
What is wasting time?
31
● What are they?
● How much significant are they?
Externals Sources
32
● PMD Processing Budget (3Mpps): 0.16µs
● Ftrace tool => Kernel RCU callback: 50µs + preemption cost
● Roughly 8 batches
● rcu_nocbs=<cpu-list>, rcu_nocb_poll
External Interferences: RCU Callback
33
● nohz_full
● No way to get rid off it
External Interferences: Timer Interrupt
34
● Scheduling issues:
○ irqbalance off
○ isolcpus
● Watchdog: nowatchdog
● Power Management: processor.max_cstates=1
● Hyper Threading
● Real-Time Kernel
External Interferences: Other Sources
35
● Use DPDK L-Thread subsystem to isolate devices
● Disable mergeable buffers to increase batch sizes inside the guest
● Disable mergeable buffers to decrease per packet cost
● Increase OVS-DPDK batch size
● Increase NIC queue size
● Increase virtio ring size
● BIOS settings
● Hardware Offloading
● Faster platform/CPUs
● Improve CPU isolation in the kernel
Possible Improvements
36
Thank You
Questions & Answers
Source: http://dpdk.org/doc/guides/prog_guide/poll_mode_drv.html37