Userspace networking

Networking in UserspaceLiving on the edge

Stephen [email protected]

Problem Statement

64 1072208 352 496 640 784 928 1216136015040

5,000,000

10,000,000

15,000,000

20,000,000

Packet Size (bytes)

Pac

kets

per

sec

ond

(bid

irec

tion

al)

Intel: DPDK Overview

Server vs Infrastructure

Server Packets Network InfrastructurePacket Size 64 bytes

Packets/second 14.88 Million

Arrival rate 67.2 ns

2 GHz Clock cycles

135 cycles

3 Ghz Clock cycles

201 cycles

L3 hit on Intel® Xeon® ~40 cycles L3 miss, memory read is (201 cycles at 3 GHz)

Packet Size 1024 bytes

Packets/second 1.2 Million

Arrival rate 835 ns

2 GHz 1670 cycles

3 Ghz 2505 cycles

Traditional Linux networking

TCP Offload Engine

Good old sockets

Flexible, portable but slow

Memory mapped buffers

Efficient, but still constrained by architecture

Run in kernel

The OpenOnload architecture

� Network hardware provides a user-safe interface which

can route Ethernet packets to an application context

based on flow information contained within headers

Network Adaptor

Kernel

Context

Network Driver

Protocol

Application

Context

Application Application

DMADMA

Application

Context

Application

Protocol

Driver

No new protocols


� Protocol processing can take place both in the

application and kernel context for a given flow

Network Adaptor

Kernel

Context

Network Driver

Protocol

Application

Context


DMADMA

Application

Context

Application

Protocol

Driver

Enables persistent / asynchronousprocessing

Maintains existing network control-plane


� Protocol state is shared between the kernel and

application contexts through a protected shared

memory communications channel

Network Adaptor

Kernel

Context

Network Driver

Protocol

Application

Context


DMADMA

Application

Context

Application

Protocol

Driver

Enables correct handling of protocol state with high-performance

Performance metrics

� Overhead

– Networking overheads take CPU time away from your application

� Latency

– Holds your application up when it has nothing else to do

– H/W + flight time + overhead

� Bandwidth

– Dominates latency when messages are large

– Limited by: algorithms, buffering and overhead

� Scalability

– Determines how overhead grows as you add cores, memory, threads, sockets etc.

Anatomy of kernel-based networking

A user-level architecture?

Direct & safe hardware access

Some performance results

� Test platform: typical commodity server

– Intel clovertown 2.3 GHz quad-core xeon (x1)

1.3 GHz FSB, 2 Gb RAM

– Intel 5000X chipset

– Solarflare Solarstorm SFC4000 (B) controller, CX4

– Back-to-back

– RedHat Enterprise 5 (2.6.18-8.el5)

Performance: Latency and overhead

1.15.3Onload

7.011.2Kernel

--4.2Hardware

CPU overhead

(microseconds)

½ round-trip latency

(microseconds)

� TCP ping-pong with 4 byte payload

� 70 byte frame: 14+20+20+12+4

Performance: Streaming bandwidth

Performance: UDP transmit

� Nessage rate:

– 4 byte UDP payload (46 byte

frame)

2,030,000473,0001 sender

OnloadKernel

Performance: UDP transmit

� Nessage rate:

– 4 byte UDP payload (46 byte

frame)

2,030,000473,0001 sender

3,880,000532,0002 senders

OnloadKernel

Performance: UDP receive

OpenOnload Open Source

� OpenOnload available as Open Source (GPLv2)

– Please contact us if you’re interested

� Compatible with x86 (ia32, amd64/emt64)

� Currently supports SMC10GPCIe-XFP and SMC10GPCIe-10BT

NICs

– Could support other user-accessible network interfaces

� Very interested in user feedback

– On the technology and project directions

Netmap

http://info.iet.unipi.it/~luigi/netmap/

● BSD (and Linux port)

● Good scalability

● Libpcap emulation

Netmap

Netmap API

● Access– open("/dev/netmap")

– ioctl(fd, NIOCREG, arg)

– mmap(..., fd, 0) maps buffers and rings

● Transmit

– fill up to avail buffers, starting from slot cur.

– ioctl(fd,NIOCTXSYNC) queues the packets

● Receive

– ioctl(fd,NIOCRXSYNC) reports newly received packets

– process up to avail buffers, starting from slot cur.

These ioctl()s are non-blocking.

Netmap API: synchronization

● poll() and select(), what else!

– POLLIN and POLLOUT decide which sets of rings to work on

– work as expected, returning when avail>0

– interrupt mitigation delays are propagated up to the userspace process

Netmap: multiqueue

● Of course.– one netmap ring per physical ring

– by default, the fd is bound to all rings

– ioctl(fd, NIOCREG, arg) can restrict the binding to a single ring pair

– multiple fd's can be bound to different rings on the same card

– the fd's can be managed by different threads

– threads mapped to cores with pthread_setaffinity()

Netmap and the host stack

● While in netmap mode, the control path remains unchanged:

– ifconfig, ioctl's, etc still work as usual

– the OS still believes the interface is there

● The data path is detached from the host stack:– packets from NIC end up in RX netmap rings

– packets from TX netmap rings are sent to the NIC

● The host stack is attached to an extra netmap rings:– packets from the host go to a SW RX netmap ring

– packets from a SW TX netmap ring are sent to the host

– these rings are managed using the netmap API

Netmap: Tx performance

Netmap: Rx Performance

Netmap SummaryPacket Forwarding Mpps

Freebsd bridging 0.690

Netmap + libpcap 7.500

Netmap 14.88

Open vSwitch Mpps

userspace 0.065

linux 0.600

FreeBSD 0.790

FreeBSD+netmap/pcap 3.050

Intel DPDK Architecture

TRANSFORMING COMMUNICATIONSIntel Restricted Secret2020 TRANSFORMING COMMUNICATIONS

The Intel® DPDK Philosophy

• Must run on any IA CPU‒ From Intel® Atom™ processor to the

latest Intel® Xeon® processor family‒ Essential to the IA value proposition‒

• Focus on the fast-path ‒ Sending large number of packets to the

Linux Kernel /GPOS will bog the system down

Provide software examples that address common network performance deficits

‒ Best practices for software architecture‒ Tips for data structure design and storage‒ Help the compiler generate optimum code‒ Address the challenges of achieving 80

Mpps per CPU Socket

Control Plane Data Plane

Intel® DPDK Fundamentals• Implements a run to completion model or

pipeline model• No scheduler - all devices accessed by

polling• Supports 32-bit and 64-bit with/without

NUMA • Scales from Intel® Atom™ to Intel®

Xeon® processors• Number of Cores and Processors not

limited• Optimal packet allocation across DRAM

channels


Platform Hardware

Intel® DPDK Libraries

Intel® Data Plane Development Kit (Intel® DPDK)Intel® DPDK embeds optimizations for the IA platform:- Data Plane Libraries and Optimized NIC Drivers in Linux User Space

- Run-time Environment

- Environment Abstraction Layer and Boot Code

- BSD-licensed & source downloadable from Intel and leading ecopartners

Linux Kernel

Packet Flow Classification

NIC Poll Mode Library

Queue/Ring Functions

Buffer ManagementCustomer Application

Customer Application

Customer Application

Environment Abstraction Layer

Environment Abstraction Layer

Kernel Space

User Space


Intel® DPDK Libraries and Drivers

• Memory Manager: Responsible for allocating pools of objects in memory. A pool is created in huge page memory space and uses a ring to store free objects. It also provides an alignment helper to ensure that objects are padded to spread them equally on all DRAM channels.

• Buffer Manager: Reduces by a significant amount the time the operating system spends allocating and de-allocating buffers. The Intel® DPDK pre-allocates fixed size buffers which are stored in memory pools.

• Queue Manager:: Implements safe lockless queues, instead of using spinlocks, that allow different software components to process packets, while avoiding unnecessary wait times.

• Flow Classification: Provides an efficient mechanism which incorporates Intel® Streaming SIMD Extensions (Intel® SSE) to produce a hash based on tuple information so that packets may be placed into flows quickly for processing, thus greatly improving throughput.

• Poll Mode Drivers: The Intel® DPDK includes Poll Mode Drivers for 1 GbE and 10 GbE Ethernet* controllers which are designed to work without asynchronous, interrupt-based signaling mechanisms, which greatly speeds up the packet pipeline.


Intel® DPDK Native and Virtualized Forwarding Performance

Comparison

Netmap DPDK OpenOnload

License BSD BSD GPL

API Packet + pcap Packet + lib Sockets

Kernel Yes Yes Yes

HW support Intel, realtek Intel Solarflare

OS FreeBSD, Linux Linux Linux

Issues

● Out of tree kernel code

– Non standard drivers

● Resource sharing

– CPU

– NIC

● Security

– No firewall

– DMA isolation

What's needed?

● Netmap

– Linux version (not port)

– Higher level protocols?

● DPDK

– Wider device support

– Ask Intel

● Openonload

– Ask Solarflare

● OpenOnload

– A user-level network stack (Google tech talk)● Steve Pope ● David Riddoch

● Netmap - Luigi Rizzo

– http://info.iet.unipi.it/~luigi/netmap/talk-atc12.html

● DPDK– Intel DPDK Overview

– Disruptive network IP networking● Naoto MASMOTO

Thank you

Userspace networking

Technology

Transcript of Userspace networking