Userspace networking

42
Networking in Userspace Living on the edge Stephen Hemminger [email protected]

description

Seven years ago at LCA, Van Jacobsen introduced the concept of net channels but since then the concept of user mode networking has not hit the mainstream. There are several different user mode networking environments: Intel DPDK, BSD netmap, and Solarflare OpenOnload. Each of these provides higher performance than standard Linux kernel networking; but also creates new problems. This talk will explore the issues created by user space networking including performance, internal architecture, security and licensing.

Transcript of Userspace networking

Page 1: Userspace networking

Networking in UserspaceLiving on the edge

Stephen [email protected]

Page 2: Userspace networking

Problem Statement

64 1072208 352 496 640 784 928 1216136015040

5,000,000

10,000,000

15,000,000

20,000,000

Packet Size (bytes)

Pac

kets

per

sec

ond

(bid

irec

tion

al)

Intel: DPDK Overview

Page 3: Userspace networking

Server vs Infrastructure

Server Packets Network InfrastructurePacket Size 64 bytes

Packets/second 14.88 Million

Arrival rate 67.2 ns

2 GHz Clock cycles

135 cycles

3 Ghz Clock cycles

201 cycles

L3 hit on Intel® Xeon® ~40 cycles L3 miss, memory read is (201 cycles at 3 GHz)

Packet Size 1024 bytes

Packets/second 1.2 Million

Arrival rate 835 ns

2 GHz 1670 cycles

3 Ghz 2505 cycles

Page 4: Userspace networking

Traditional Linux networking

Page 5: Userspace networking
Page 6: Userspace networking

TCP Offload Engine

Page 7: Userspace networking

Good old sockets

Flexible, portable but slow

Page 8: Userspace networking

Memory mapped buffers

Efficient, but still constrained by architecture

Page 9: Userspace networking

Run in kernel

Page 10: Userspace networking

Slide 7

The OpenOnload architecture

� Network hardware provides a user-safe interface which

can route Ethernet packets to an application context

based on flow information contained within headers

Network Adaptor

Kernel

Context

Network Driver

Protocol

Application

Context

Application Application

DMADMA

Application

Context

Application

Protocol

Driver

No new protocols

Page 11: Userspace networking

Slide 8

The OpenOnload architecture

� Protocol processing can take place both in the

application and kernel context for a given flow

Network Adaptor

Kernel

Context

Network Driver

Protocol

Application

Context

Application Application

DMADMA

Application

Context

Application

Protocol

Driver

Enables persistent / asynchronousprocessing

Maintains existing network control-plane

Page 12: Userspace networking

Slide 9

The OpenOnload architecture

� Protocol state is shared between the kernel and

application contexts through a protected shared

memory communications channel

Network Adaptor

Kernel

Context

Network Driver

Protocol

Application

Context

Application Application

DMADMA

Application

Context

Application

Protocol

Driver

Enables correct handling of protocol state with high-performance

Page 13: Userspace networking

Slide 11

Performance metrics

� Overhead

– Networking overheads take CPU time away from your application

� Latency

– Holds your application up when it has nothing else to do

– H/W + flight time + overhead

� Bandwidth

– Dominates latency when messages are large

– Limited by: algorithms, buffering and overhead

� Scalability

– Determines how overhead grows as you add cores, memory, threads, sockets etc.

Page 14: Userspace networking

Slide 12

Anatomy of kernel-based networking

Page 15: Userspace networking

Slide 13

A user-level architecture?

Page 16: Userspace networking

Slide 14

Direct & safe hardware access

Page 17: Userspace networking

Slide 88

Some performance results

� Test platform: typical commodity server

– Intel clovertown 2.3 GHz quad-core xeon (x1)

1.3 GHz FSB, 2 Gb RAM

– Intel 5000X chipset

– Solarflare Solarstorm SFC4000 (B) controller, CX4

– Back-to-back

– RedHat Enterprise 5 (2.6.18-8.el5)

Page 18: Userspace networking

Slide 89

Performance: Latency and overhead

1.15.3Onload

7.011.2Kernel

--4.2Hardware

CPU overhead

(microseconds)

½ round-trip latency

(microseconds)

� TCP ping-pong with 4 byte payload

� 70 byte frame: 14+20+20+12+4

Page 19: Userspace networking

Slide 92

Performance: Streaming bandwidth

Page 20: Userspace networking

Slide 93

Performance: UDP transmit

� Nessage rate:

– 4 byte UDP payload (46 byte

frame)

2,030,000473,0001 sender

OnloadKernel

Page 21: Userspace networking

Slide 94

Performance: UDP transmit

� Nessage rate:

– 4 byte UDP payload (46 byte

frame)

2,030,000473,0001 sender

3,880,000532,0002 senders

OnloadKernel

Page 22: Userspace networking

Slide 95

Performance: UDP receive

Page 23: Userspace networking

Slide 100

OpenOnload Open Source

� OpenOnload available as Open Source (GPLv2)

– Please contact us if you’re interested

� Compatible with x86 (ia32, amd64/emt64)

� Currently supports SMC10GPCIe-XFP and SMC10GPCIe-10BT

NICs

– Could support other user-accessible network interfaces

� Very interested in user feedback

– On the technology and project directions

Page 24: Userspace networking

Netmap

http://info.iet.unipi.it/~luigi/netmap/

● BSD (and Linux port)

● Good scalability

● Libpcap emulation

Page 25: Userspace networking

Netmap

Page 26: Userspace networking

Netmap API

● Access– open("/dev/netmap")

– ioctl(fd, NIOCREG, arg)

– mmap(..., fd, 0) maps buffers and rings

● Transmit

– fill up to avail buffers, starting from slot cur.

– ioctl(fd,NIOCTXSYNC) queues the packets

● Receive

– ioctl(fd,NIOCRXSYNC) reports newly received packets

– process up to avail buffers, starting from slot cur.

These ioctl()s are non-blocking.

Page 27: Userspace networking

Netmap API: synchronization

● poll() and select(), what else!

– POLLIN and POLLOUT decide which sets of rings to work on

– work as expected, returning when avail>0

– interrupt mitigation delays are propagated up to the userspace process

Page 28: Userspace networking

Netmap: multiqueue

● Of course.– one netmap ring per physical ring

– by default, the fd is bound to all rings

– ioctl(fd, NIOCREG, arg) can restrict the binding to a single ring pair

– multiple fd's can be bound to different rings on the same card

– the fd's can be managed by different threads

– threads mapped to cores with pthread_setaffinity()

Page 29: Userspace networking

Netmap and the host stack

● While in netmap mode, the control path remains unchanged:

– ifconfig, ioctl's, etc still work as usual

– the OS still believes the interface is there

● The data path is detached from the host stack:– packets from NIC end up in RX netmap rings

– packets from TX netmap rings are sent to the NIC

● The host stack is attached to an extra netmap rings:– packets from the host go to a SW RX netmap ring

– packets from a SW TX netmap ring are sent to the host

– these rings are managed using the netmap API

Page 30: Userspace networking

Netmap: Tx performance

Page 31: Userspace networking

Netmap: Rx Performance

Page 32: Userspace networking

Netmap SummaryPacket Forwarding Mpps

Freebsd bridging 0.690

Netmap + libpcap 7.500

Netmap 14.88

Open vSwitch Mpps

userspace 0.065

linux 0.600

FreeBSD 0.790

FreeBSD+netmap/pcap 3.050

Page 33: Userspace networking

Intel DPDK Architecture

Page 34: Userspace networking

TRANSFORMING COMMUNICATIONSIntel Restricted Secret2020 TRANSFORMING COMMUNICATIONS

The Intel® DPDK Philosophy

• Must run on any IA CPU‒ From Intel® Atom™ processor to the

latest Intel® Xeon® processor family‒ Essential to the IA value proposition‒

• Focus on the fast-path ‒ Sending large number of packets to the

Linux Kernel /GPOS will bog the system down

Provide software examples that address common network performance deficits

‒ Best practices for software architecture‒ Tips for data structure design and storage‒ Help the compiler generate optimum code‒ Address the challenges of achieving 80

Mpps per CPU Socket

Control Plane Data Plane

Intel® DPDK Fundamentals• Implements a run to completion model or

pipeline model• No scheduler - all devices accessed by

polling• Supports 32-bit and 64-bit with/without

NUMA • Scales from Intel® Atom™ to Intel®

Xeon® processors• Number of Cores and Processors not

limited• Optimal packet allocation across DRAM

channels

Page 35: Userspace networking

TRANSFORMING COMMUNICATIONSIntel Restricted Secret2121 TRANSFORMING COMMUNICATIONS

Platform Hardware

Intel® DPDK Libraries

Intel® Data Plane Development Kit (Intel® DPDK)Intel® DPDK embeds optimizations for the IA platform:- Data Plane Libraries and Optimized NIC Drivers in Linux User Space

- Run-time Environment

- Environment Abstraction Layer and Boot Code

- BSD-licensed & source downloadable from Intel and leading ecopartners

Linux Kernel

Packet Flow Classification

NIC Poll Mode Library

Queue/Ring Functions

Buffer ManagementCustomer Application

Customer Application

Customer Application

Environment Abstraction Layer

Environment Abstraction Layer

Kernel Space

User Space

Page 36: Userspace networking

TRANSFORMING COMMUNICATIONSIntel Restricted Secret2222 TRANSFORMING COMMUNICATIONS

Intel® DPDK Libraries and Drivers

• Memory Manager: Responsible for allocating pools of objects in memory. A pool is created in huge page memory space and uses a ring to store free objects. It also provides an alignment helper to ensure that objects are padded to spread them equally on all DRAM channels.

• Buffer Manager: Reduces by a significant amount the time the operating system spends allocating and de-allocating buffers. The Intel® DPDK pre-allocates fixed size buffers which are stored in memory pools.

• Queue Manager:: Implements safe lockless queues, instead of using spinlocks, that allow different software components to process packets, while avoiding unnecessary wait times.

• Flow Classification: Provides an efficient mechanism which incorporates Intel® Streaming SIMD Extensions (Intel® SSE) to produce a hash based on tuple information so that packets may be placed into flows quickly for processing, thus greatly improving throughput.

• Poll Mode Drivers: The Intel® DPDK includes Poll Mode Drivers for 1 GbE and 10 GbE Ethernet* controllers which are designed to work without asynchronous, interrupt-based signaling mechanisms, which greatly speeds up the packet pipeline.

Page 37: Userspace networking

TRANSFORMING COMMUNICATIONSIntel Restricted Secret2323 TRANSFORMING COMMUNICATIONS

Intel® DPDK Native and Virtualized Forwarding Performance

Page 38: Userspace networking

Comparison

Netmap DPDK OpenOnload

License BSD BSD GPL

API Packet + pcap Packet + lib Sockets

Kernel Yes Yes Yes

HW support Intel, realtek Intel Solarflare

OS FreeBSD, Linux Linux Linux

Page 39: Userspace networking

Issues

● Out of tree kernel code

– Non standard drivers

● Resource sharing

– CPU

– NIC

● Security

– No firewall

– DMA isolation

Page 40: Userspace networking

What's needed?

● Netmap

– Linux version (not port)

– Higher level protocols?

● DPDK

– Wider device support

– Ask Intel

● Openonload

– Ask Solarflare

Page 41: Userspace networking

● OpenOnload

– A user-level network stack (Google tech talk)● Steve Pope ● David Riddoch

● Netmap - Luigi Rizzo

– http://info.iet.unipi.it/~luigi/netmap/talk-atc12.html

● DPDK– Intel DPDK Overview

– Disruptive network IP networking● Naoto MASMOTO

Page 42: Userspace networking

Thank you