Understanding DPDK

Understanding

DPDKDescription of techniques used to achieve

high throughput on a commodity hardware

How fast SW has to work?

14.88 millions of 64 byte packets per second on 10G interface

1.8 GHz -> 1 cycle = 0,55 ns

1 packet -> 67.2 ns = 120 clock cycles

IFGPream

MACType Payload CRC

84 Bytes

412 8 60

Comparative speed values

CPU to memory speed = 6-8 GBytes/s

PCI-Express x16 speed = 5 GBytes/s

Access to RAM = 200 ns

Access to L3 cache = 4 ns

Context switch ~= 1000 ns (3.2 GHz)

Packet processing in Linux

User space

Kernel space

Driver

RX/TX queues

SocketRing

buffers

Linux kernel overhead

System calls

Context switching on blocking I/O

Data copying from kernel to user space

Interrupt handling in kernel

Expense of sendto

Function Activity Time (ns)

sendto system call 96

sosend_dgram lock sock_buff, alloc mbuf, copy in 137

udp_output UDP header setup 57

ip_output route lookup, ip header setup 198

ether_otput MAC lookup, MAC header setup 162

ixgbe_xmit device programming 220

Total 950

Packet processing with DPDK

User space

Kernel space

App DPDKRing

buffers

UIO driver

queues

Kernel space

Updating a register in Linux

User space

ioctl()

Register

syscall

copy_from_user()

iowrite()

Updating a register with DPDK

User space

assign

Register

What is used inside DPDK?

Processor affinity (separate cores)

Huge pages (no swap, TLB)

UIO (no copying from kernel)

Polling (no interrupts overhead)

Lockless synchronization (avoid waiting)

Batch packets handling

SSE, NUMA awareness

Linux default scheduling

Core 0

Core 1

Core 2

Core 3

t1 t4t3t2

How to isolate a core for a process

To diagnose use top“top” , press “f” , press “j”

Before boot use isolcpus“isolcpus=2,4,6”

After boot - use cpuset“cset shield -c 1-3”, “cset shield -k on”

Core 2Core 1

Run-to-completion model

thread

Port 1 Port 2

Core 2Core 1

Pipeline model

thread

Port 1 Port 2

Page tables tree

Linux paging model

Global

Directory

TLBPage

OffsetVirtual page

Physical Page Offset

TLB characteristics

$ cpuid | grep -i tlb

size: 12–4,096 entries

hit time: 0.5–1 clock cycle

miss penalty: 10–100 clock cycles

miss rate: 0.01–1%

It is very expensive resource!

Solution - Hugepages

Benefit: optimized TLB usage, no swap

Hugepage size = 2M

Usage:

mount hugetlbfs /mnt/huge

Library - libhugetlbfs

Lockless ring design

Writer can preempt writer and reader

Reader can not preempt writer

Reader and writer can work simultaneously on

different cores

Barrier

CAS operation

Bulk queue/dequeue

Lockless ring (Single Producer)

cons_head

cons_tail

prod_head

prod_tail

prod_next 2

cons_head

cons_tail

prod_head

prod_next

prod_tail

cons_head

cons_tail

prod_head

prod_tail

Lockless ring (Single Consumer)

cons_head

cons_tail

prod_head

prod_tail

cons_next 2

cons_tail prod_head

prod_tail

cons_next

cons_head

cons_tail

prod_head

prod_tail

Lockless ring (Multiple Producers)

cons_head

cons_tail

prod_head

prod_tail

prod_next1

prod_next2 3

cons_head

cons_tail

prod_head

cons_head

cons_tail

prod_head

prod_next2

prod_tail

prod_next1

cons_head

cons_tail

cons_head

cons_tail

prod_head

prod_tail

prod_headprod_tail

prod_next1

prod_next2

prod_next1

prod_next2

Kernel space network driver

IP stack

Driver

Config

User space

Kernel space

Interrupts

“The most important devices can’t be handled

in user space, including, but not

limited to, network interfaces and block

devices.” - LDD3

User space

Kernel space

Interfacesysfs /dev/uioX

US driver epoll()

mmap()

UIO framework

driver

NIC User space

Access to device from user space

BAR0 (Mem)

BAR2 (IO)

Vendor Id

Device Id

Command

Revision Id

Status

Configuration

registers

I/O and memory

regions

/sys/class/uio/uioX/maps/mapX

/sys/class/uio/uioX/portio/portX

/dev/uioX -> mmap (offset)

/sys/bus/pci/devices

Host memory NIC memory

DMA RX

Update RDT

DMA descriptor(s)

RX queue RX FIFO

DMA packet

Descriptor ringMemory

DMA descriptors

Host memory NIC memory

DMA TX

Update TDT

DMA descriptor(s)

TX queue TX FIFO

DMA packet

Descriptor ringMemory

DMA descriptors

Receive from SW side

DD DD DDDD

DDmbuf1

addrDD

RDH = 1

RDT = 5

RDBA = 0

RDLEN = 6

mbuf1RDH

Transmit from SW side

DD DD DDDD

DDmbuf1

addrDD

TDH = 1

TDT = 5

TDBA = 0

TDLEN = 6

mbuf1TDH

I/O controller

PCI-E PCI-E

I/O controller

PCI-E PCI-E

Socket 0 Socket 1

RSS (Receive Side Scaling)

functionQueue 0 CPU N

Queue N

Incoming traffic Indirection

Flow director

Queue 0 CPU N

Queue N

Incoming trafficFilter table

function

Outgoing traffic

Drop Route

Virtualization - SR-IOV

VF driver

PF driver

Virtual bridge

Slow path using bifurcated driver

Kernel DPDK

Virtual bridge

PF Filter table

Slow path using TAP

User space

Kernel space

App DPDKRing

buffers

TAP device

queues

TCP/IP

Slow path using KNI

User space

Kernel space

App DPDKRing

buffers

KNI device

queues

TCP/IP

x86 HW

Application 1 - Traffic generator

User space

Streams generator

Traffic analyzer

x86 HW

Application 2 - Router

Kernel

User space

Routing table

Routing table cacheDUT1 DUT2

x86 HW

Application 3 - Middlebox

User space

DPIDUT1 DUT2

References

Device Drivers in User Space

Userspace I/O drivers in a realtime context

The Userspace I/O HOWTO

The anatomy of a PCI/PCI Express kernel driver

From Intel® Data Plane Development Kit to Wind River Network Acceleration

Platform

DPDK Design Tips (Part 1 - RSS)

Getting the Best of Both Worlds with Queue Splitting (Bifurcated Driver)

Design considerations for efficient network applications with Intel® multi-core

processor-based systems on Linux

Introduction to Intel Ethernet Flow Director

My blog

Learning Network Programming

Understanding DPDK

Software

Transcript of Understanding DPDK

Improving DPDK Performance - Netcope

DPDK Summit China 2017 · DPDK Cryptodev Framework Crypto framework for processing symmetric crypto workloads in DPDK. DPDK Cryptodev consists of: SW and HW Crypto PMDs A standard

DPDK-in-a-Box · 2020. 3. 21. · DPDK-in-a-Box David Hunt - Intel DPDK US Summit - San Jose - 2016

Intel Dpdk API Reference

DPDK summit 2015: It's kind of fun to do the impossible with DPDK

DPDK ARCHITECTURE AND ROADMAP DISCUSSIONstatic.dpdk.org/events/slides/DPDK-2017-04-Architecture...Agenda Key trends in network transformation DPDK role DPDK Architecture Multi Architecture

Dpdk accelerated Ostinato

DPDK Summit China 2017 · Network Platforms Group DPDK China Summit 2017 ，Shanghai 8 vDPA: DPDK High Level Design DPDK vhost-user library CP-Protocol, communicate channel with QEMU

DPDK Summit - 08 Sept 2014 - VMware and Intel - Using DPDK In A Virtual World

Understanding The Performance of DPDK as a Computer · PDF fileUnderstanding The Performance of DPDK as a Computer Architect DPDK US Summit - San Jose - 2016 XIAOBAN WU *, PEILONG

DPDK Test Suite - readthedocs.org

DPDK, VPP/FD.io and

DPDK Test Suite

DPDK, VPP & pfSense 3 VPP & pfSense 3.0 Jim Thompson DPDK Summit Userspace - Dublin- 2017

Dpdk Sample Apps 1.7.0

Putting DPDK in production€¦ · Putting DPDK in production DPDK users perspective Franck Baudin, Principal Product Manager, OpenStack NFV Anita Tragler, Senior Product Manager,

DPDK on Azure · Ubuntu, SLES, RHEL, CentOS ... – Experimental Kernel 4.18 DPDK 18.11 – Device Assignment – Vswitch or SR-IOV. DPDK on Azure Roadmap 2017

DPDK packet capture

Chelsio DPDK Driver for Linuxservice.chelsio.com/beta/drivers/Chelsio-DPDK-2.0.0.0/Chelsio-DPDK... · Chapter 1. Introduction Chelsio DPDK Driver for Linux 5 1. Introduction Thank

Introduction - Xilinx...DPDK Poll Mode Driver The Xilinx reference QDMA DPDK driver is based on DPDK v17.11.1. The DPDK driver is tested by binding the PCIe functions with the igb_uio