Dev Conf 2017 - Meeting nfv networking requirements

Meeting Networking Requirements for NFV

Flavio Bruno LeitnerPrincipal Software Engineer - Networking Service TeamJanuary 2017

● NFV concepts and goals● NFV requirements● 10G Ethernet● Physical-Virtual-Physical (PVP) scenario● Some network solutions● Dive into DPDK enabled Open vSwitch● Possible improvements

Agenda

2

Virtualize network hardware appliances

NFV - Network Functions Virtualization

3

Virtualization

VM VM VMFirewall

LB

Router

A new product/project needs new networking infrastructure

NFV - Goals

4

Before● Slow Process● High Cost● Less Flexibility

After● Fast Process● Lower Cost● Greater Flexibility

Deploy a new service with a click!

NFV - Networking Requirements

5

VM

Virtualization

=

Low Latency

High Throughput

… with zero packet loss

NFV Requirements - Challenge

6

Worse case: Wirespeed smallest frame

Packet rate: 14.88Mpps (million packets per second)

Ethernet specific: 20 bytes [Inter-frame gap (12) + MAC preamble (8)]

Ethernet frame: 64 bytes [MAC header(14) + Payload(46)]

Minimum Ethernet frame size: 20 + 64 = 84 bytes.

Challenge 10GBit/s

7

How much time per packet?

1 / 14.88Mpps = 67.2 nanoseconds

3GHz CPU => ~200 cycles

Cache Miss => ~32 nanoseconds

L2 Cache Hit => ~10 cycles

L3 Cache hit=> ~36 cycles

Small Budget!

Challenge 10GBit/s - 14.88Mpps

Sources: http://www.intel.co.uk/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdfhttps://people.netfilter.org/hawk/presentations/nfws2014/dp-accel-10G-challenge.pdf8

http://www.intel.co.uk/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Networking to Virtual Machines - PVP

9

VM

LogicPort

LogicPort

vSwitchPhysPort

PhysPort

Traffic Generator

● Linux Bridge

● Open vSwitch (OVS)

● SR-IOV

● DPDK Enabled Open vSwitch (OVS-DPDK)

Networking to Virtual Machines

10

● Use the kernel datapath

● NAPI

● Unpredictable latency

● Not SDN ready

● Low throughput: ~1Mpps/core (Phy-to-Phy)

● qemu runs in userspace

Linux Bridge

11

● Use the kernel datapath

● NAPI

● Unpredictable latency

● SDN ready

● Low throughput: ~1Mpps/core

● qemu runs in userspace

Open vSwitch

12

● Low latency

● High throughput

● Bypass the host

● Not SDN friendly - Can’t use a virtual switch in the host

● Physical HW exposed - no abstraction, certification issues/costs

● Migration issues

● Limited number of devices

SR-IOV

13

What is DPDK?

● A set of libraries and drivers for fast packet processing.

● Open Source, BSD License

Usage:

● Receive and send packets within the minimum number of CPU cycles.

What it is not:

● A networking stack

Data Plane Development Kit (DPDK)

14

Consists of APIs, provided through the BSD driver running in userspace, to

configure the devices and their respective queues. In addition, a PMD

accesses the RX and TX descriptors directly without any interrupts to quickly

receive, process and deliver packets in the user’s application.

DPDK - Poll-Mode Drivers

Source: http://dpdk.org/doc/guides/prog_guide/poll_mode_drv.html15

● Open vSwitch kernel module is just a cache managed by userspace.

● DPDK provides the libraries and drivers to RX/TX from userspace.

● Yeah, DPDK enabled Open vSwitch!

● Remember the 14.88Mpps? ~16Mpps/core Phys-to-Phys.

● Cost at least one core 100% busy running the PMD thread.

(power consumption, cooling, wasted cycles)

Open vSwitch + DPDK

16

● Provide network connectivity to Virtual Machines

● Qemu runs in userspace

● Vhost-user interface (TX/RX shared virtqueues)

● Guests can choose between kernel or userspace

● Throughput: ~3.5Mpps/core (default features, PVP, tuned)

● Scales up linearly with multiple parallel streams

● System needs to be carefully tuned

OVS-DPDK for NFV

17

● Poll-Mode Driver thread owns a CPU

● Devices (queues) are distributed between PMD threads

● Each PMD thread will busy loop polling and processing

● Run-To-Completion

● Batching (reduce per packet processing cost)

How does it work?

18

X-Ray Patient: OVS-DPDK PMD Thread

19

Port 1

PMD

FW Plane

DROP

Port 2 Port n

PMD in PVP

20

P1

PMD

FW Plane

P2 L1 L2 VM

LogicPort

LogicPort

vSwitchPhysPort

PhysPort

Traffic Generator

Packet Flow

21

PhysicalNIC (10)

PMD

FW Plane

PhysicalNIC (11)

vhost-user (20)

vhost-user (21)

in_port=10,action=21in_port=20,action=11

Measuring Throughput: Zero Packet Loss

22

Expected:● Constant traffic rate● System is constantly dropping packets● Decrease traffic rate, repeat

Packet Drops: Aim For Weak Spots

23

PhysicalNIC (10)

PMD

FW Plane

PhysicalNIC (11)

vhost-user(20)

vhost-user(21)

in_port=10,action=21in_port=20,action=11

Packet Drops: NIC RX QUEUE

24

PhysicalNIC

PMD

FW Plane

● Fixed sized limited by hardware● Drops are reported in the port stats● Queue overflow

(producer-consumer problem)

Packet Drops: Vhost-user TX Queue

25

PMD

FW Plane

DROP

Guestvhost-user

● Fixed sized limited in software● Drops reported in the guest● Queue overflow


Packet Drops: Vhost-user RX Queue

26

PMD

FW Plane

Guestvhost-user

● Fixed sized limited in software● Drops are reported in the port stats● Queue overflow


Measuring Throughput: Zero Packet Loss

27

Expected:● Constant traffic rate● System is constantly dropping packets● Decrease traffic rate, repeat

Reality:● System is stable for a period of time● Few packets dropped sporadically● Decrease traffic rate, repeat● Very low throughput● Understand what is causing the drops

Estimating PMD Processing Budget

28

Throughput (Mpps) Proc. Budget (µs) PMD Budget (µs)

3.0

4.0

5.0

6.0

0.33

0.25

0.20

0.16

0.16

0.12

0.10

0.08

Measuring Polling/Processing cost.

29

Device Mode Time (µs)Phys Ingress Polling 0.2

Phys Ingress Processing 3.1

Phys egress Polling 0.016

Phys egress Processing 0

vhost-user ingress Polling 0.013

vhost-user ingress Processing 0

vhost-user egress Polling 0.73

vhost-user egress Processing 2.14

Total Polling+Processing 6.2

● Total of 6.2µs is 24x the per packet budget (0.25µs)

● Assuming 32 packets in a batch, per packet reduces to 0.19µs, ~5Mpps

● 3.5Mpps 0 packet loss (0.29µs) => batch size of 21.4 in average.

Batching

30

● Internal sources

● External sources

What is wasting time?

31

● What are they?

● How much significant are they?

Externals Sources

32

● PMD Processing Budget (3Mpps): 0.16µs

● Ftrace tool => Kernel RCU callback: 50µs + preemption cost

● Roughly 8 batches

● rcu_nocbs=<cpu-list>, rcu_nocb_poll

External Interferences: RCU Callback

33

● nohz_full

● No way to get rid off it

External Interferences: Timer Interrupt

34

● Scheduling issues:

○ irqbalance off

○ isolcpus

● Watchdog: nowatchdog

● Power Management: processor.max_cstates=1

● Hyper Threading

● Real-Time Kernel

External Interferences: Other Sources

35

● Use DPDK L-Thread subsystem to isolate devices

● Disable mergeable buffers to increase batch sizes inside the guest

● Disable mergeable buffers to decrease per packet cost

● Increase OVS-DPDK batch size

● Increase NIC queue size

● Increase virtio ring size

● BIOS settings

● Hardware Offloading

● Faster platform/CPUs

● Improve CPU isolation in the kernel

Possible Improvements

36

Thank You

Questions & Answers

Source: http://dpdk.org/doc/guides/prog_guide/poll_mode_drv.html37

Dev Conf 2017 - Meeting nfv networking requirements

Engineering

Transcript of Dev Conf 2017 - Meeting nfv networking requirements