Understanding DPDK

Post on 15-Jul-2015

2.598 views 30 download

Tags:

Transcript of Understanding DPDK

Understanding

DPDKDescription of techniques used to achieve

high throughput on a commodity hardware

How fast SW has to work?

14.88 millions of 64 byte packets per second on 10G interface

1.8 GHz -> 1 cycle = 0,55 ns

1 packet -> 67.2 ns = 120 clock cycles

IFGPream

ble

DST

MAC

SRC

MAC

SRC

MACType Payload CRC

84 Bytes

412 8 60

Comparative speed values

CPU to memory speed = 6-8 GBytes/s

PCI-Express x16 speed = 5 GBytes/s

Access to RAM = 200 ns

Access to L3 cache = 4 ns

Context switch ~= 1000 ns (3.2 GHz)

Packet processing in Linux

User space

Kernel space

NIC

App

Driver

RX/TX queues

SocketRing

buffers

Linux kernel overhead

System calls

Context switching on blocking I/O

Data copying from kernel to user space

Interrupt handling in kernel

Expense of sendto

Function Activity Time (ns)

sendto system call 96

sosend_dgram lock sock_buff, alloc mbuf, copy in 137

udp_output UDP header setup 57

ip_output route lookup, ip header setup 198

ether_otput MAC lookup, MAC header setup 162

ixgbe_xmit device programming 220

Total 950

Packet processing with DPDK

User space

Kernel space

NIC

App DPDKRing

buffers

UIO driver

RX/TX

queues

Kernel space

Updating a register in Linux

User space

HW

ioctl()

Register

syscall

VFS

copy_from_user()

iowrite()

Updating a register with DPDK

User space

HW

assign

Register

What is used inside DPDK?

Processor affinity (separate cores)

Huge pages (no swap, TLB)

UIO (no copying from kernel)

Polling (no interrupts overhead)

Lockless synchronization (avoid waiting)

Batch packets handling

SSE, NUMA awareness

Linux default scheduling

Core 0

Core 1

Core 2

Core 3

t1 t4t3t2

How to isolate a core for a process

To diagnose use top“top” , press “f” , press “j”

Before boot use isolcpus“isolcpus=2,4,6”

After boot - use cpuset“cset shield -c 1-3”, “cset shield -k on”

Core 2Core 1

Run-to-completion model

RX/TX

thread

RX/TX

thread

Port 1 Port 2

Core 2Core 1

Pipeline model

RX

thread

TX

thread

Port 1 Port 2

Ring

Page tables tree

Linux paging model

cr3

Page

Page

Global

Directory

Page

Table

Page

Middle

Directory

TLB

TLBPage

Table

RAM

OffsetVirtual page

Physical Page Offset

TLB characteristics

$ cpuid | grep -i tlb

size: 12–4,096 entries

hit time: 0.5–1 clock cycle

miss penalty: 10–100 clock cycles

miss rate: 0.01–1%

It is very expensive resource!

Solution - Hugepages

Benefit: optimized TLB usage, no swap

Hugepage size = 2M

Usage:

mount hugetlbfs /mnt/huge

mmap

Library - libhugetlbfs

Lockless ring design

Writer can preempt writer and reader

Reader can not preempt writer

Reader and writer can work simultaneously on

different cores

Barrier

CAS operation

Bulk queue/dequeue

Lockless ring (Single Producer)

1

cons_head

cons_tail

prod_head

prod_tail

prod_next 2

cons_head

cons_tail

prod_head

prod_next

prod_tail

3

cons_head

cons_tail

prod_head

prod_tail

Lockless ring (Single Consumer)

1

cons_head

cons_tail

prod_head

prod_tail

cons_next 2

cons_tail prod_head

prod_tail

cons_next

cons_head

3

cons_head

cons_tail

prod_head

prod_tail

Lockless ring (Multiple Producers)

1

cons_head

cons_tail

prod_head

prod_tail

prod_next1

prod_next2 3

cons_head

cons_tail

prod_head

2

cons_head

cons_tail

prod_head

prod_next2

prod_tail

prod_next1

4

cons_head

cons_tail

5

cons_head

cons_tail

prod_head

prod_tail

prod_tail

prod_headprod_tail

prod_next1

prod_next2

prod_next1

prod_next2

Kernel space network driver

App

IP stack

Driver

NIC

Data

Desc

Config

Data

User space

Kernel space

Interrupts

UIO

“The most important devices can’t be handled

in user space, including, but not

limited to, network interfaces and block

devices.” - LDD3

UIO

User space

Kernel space

Interfacesysfs /dev/uioX

App

US driver epoll()

mmap()

UIO framework

driver

NIC User space

Access to device from user space

BAR0 (Mem)

BAR1

BAR2 (IO)

BAR5

BAR4

BAR3

Vendor Id

Device Id

Command

Revision Id

Status

...

Configuration

registers

I/O and memory

regions

/sys/class/uio/uioX/maps/mapX

/sys/class/uio/uioX/portio/portX

/dev/uioX -> mmap (offset)

/sys/bus/pci/devices

Host memory NIC memory

DMA RX

Update RDT

DMA descriptor(s)

RX queue RX FIFO

DMA packet

Descriptor ringMemory

DMA descriptors

Host memory NIC memory

DMA TX

Update TDT

DMA descriptor(s)

TX queue TX FIFO

DMA packet

Descriptor ringMemory

DMA descriptors

Receive from SW side

DD DD DDDD

RDT

DDmbuf1

addrDD

mbuf2

addr

RDT

RDH = 1

RDT = 5

RDBA = 0

RDLEN = 6

mbuf1RDH

RDH

mbuf2

Transmit from SW side

DD DD DDDD

TDT

DDmbuf1

addrDD

mbuf2

addr

TDT

TDH = 1

TDT = 5

TDBA = 0

TDLEN = 6

mbuf1TDH

TDH

mbuf2

NUMA

CPU 0

Cores

Me

mo

ry

co

ntro

ller

I/O controller

Me

mo

ry

PCI-E PCI-E

CPU 1

Cores

Me

mo

ry

co

ntr

olle

r

I/O controller

Me

mo

ry

PCI-E PCI-E

QPI

Socket 0 Socket 1

RSS (Receive Side Scaling)

Hash

functionQueue 0 CPU N

...

Queue N

Incoming traffic Indirection

table

Flow director

Queue 0 CPU N

...

Queue N

Incoming trafficFilter table

Hash

function

Outgoing traffic

Drop Route

Virtualization - SR-IOV

NIC

VMM

VM1

VF driver

VM2

VF driver

PF driver

VF

Virtual bridge

VF PF

NIC

Slow path using bifurcated driver

Kernel DPDK

VF

Virtual bridge

PF Filter table

Slow path using TAP

User space

Kernel space

NIC

App DPDKRing

buffers

TAP device

RX/TX

queues

TCP/IP

stack

Slow path using KNI

User space

Kernel space

NIC

App DPDKRing

buffers

KNI device

RX/TX

queues

TCP/IP

stack

x86 HW

Application 1 - Traffic generator

User space

Streams generator

DUT

Traffic analyzer

x86 HW

Application 2 - Router

Kernel

User space

Routing table

Routing table cacheDUT1 DUT2

x86 HW

Application 3 - Middlebox

User space

DPIDUT1 DUT2

My blog

Learning Network Programming