100 M pps on PC.

Obsługa 100M pps na platformie PC Achieving very high speed of IP packet processing on commodity PC platform using modern kernel bypassing techniques

version 2016.02.29a

Who are we?

Who are we? (1)

3

Przemysław Frasunek

• Multimedia and Security Division Director

• Responsible for redCDN and redGuardian services

• IT security passionate – over 40 vulnerabilities reported on BUGTRAQ since 1999

Paweł Małachowski

• Leads redGuardian development team

• ISP/telco/UNIX background since 1996

• Experience as business analyst, system engineer, IT operations manager

4

Multimedia Smart Grid

Cybersecurity

Phoenix-RTOS Phoenix-PRIME

Hermes

Who are we? (2)

redCDN, WTF? (1)

• redCDN – the largest Polish CDN operated by Atende Software

• CDN nodes collocated in major Polish IXPs and ISPs

• Over 400 Gbps of network capacity

• Fully based on in-house developed software since 2006

• Supports most important multimedia protocols (Smooth Streaming, MPEG-DASH, HLS) and simple HTTP/HTTPS

5

redCDN, WTF? (2)

6

redCDN, WTF? (3)

http://antyweb.pl/odwiedzilismy-atende-software-to-dzieki-nim-mozecie-ogladac-iple-i-player-pl/
























100M pps… Wait But Why?

redGuardian – brief history – 2014 Q3

• The initial idea: hey, our CDN network is utilized mostly with outgoing traffic, let’s do something new to utilize it with incoming traffic

• Maybe a DDoS protection service offered in a scrubbing center model?

• Let’s test DDoS-mitigation appliances available on the market

• Conclusions: – Radware, Huawei: too expensive, not suited for multitenancy

– Arbor: they didn’t allowed us to test their solution

9


• Let’s check the commodity hardware for ability to forward/filter large amounts of traffic

• Our goal: at least 20 Gbit/s (29.6Mpps @ 64B packets) on a single Intel-based server

• Step 1: Vanilla Linux – Intel Xeon E3-1200 (single-socket, quad-core, non-hyperthreaded)

– 2x Intel 10 GbE NIC based on Intel's X540-AT2 Ethernet controller

– 16 GB RAM (4x4 GB DDR3 1.3GHz)

– Ubuntu 14.4.1 LTS for x86-64 architecture (kernel 3.13.0-40-generic)

10

redGuardian – brief history – 2014 Q4 (test results)

11

redGuardian – brief history – 2014 Q4 (test results)

12


• Conclusion: generic OS-es with default network stacks are incapable of handling multiple 10 GbE interfaces saturated with smallest frames

• Step 2: evaluation of data-plane architectures – Intel DPDK (http://dpdk.org)

– A set of libraries for fast packet processing (BSD license)

– Handles packets within minimum number of CPU cycles

– …but provides only very basic set of functions (memory management, ring buffers, poll-mode drivers)

– Almost all of IP stack needs to be implemented on your own

13

http://dpdk.org/


14


• Step 3: simple DPDK-based L3 forwarder – Based on example code in DPDK source tree

– No ARP, single next-hop, no ACLs

– 3 CPU cores isolated from the Linux scheduler and IRQ balancer and dedicated to DPDK

– Simultaneous RX and TX of 14.8M pps @ 64B

– Hell yeah!

• Step 4: simple forwarder with ACLs – Simple L3 ACLs (IP/mask)

– 100k random entries

– 10GbE wire speed on a single CPU core

15


• Step 5: ask Paweł for joining our team to lead development of our own DDoS-mitigation solution – In first phase, we decided to focus on mitigation of volumetric attacks (especially

DNS and NTP reflection)

– In next phases, we would like to inspect and inject traffic into HTTP sessions

– We needed to implement a filter module and web panel

– We wanted to handle over 100 Gbps on a single PC

– Our filtering nodes should be installed in all major CDN sites

– Hardcore development started on March 2015

16

100M pps… what is the problem?

Challenge 100Mpps – explained

PHY speed 60B* frames [pps] 1514B frames [pps]

1Gbps ~1.48M** (1 488 095) ~81k (81274)

10Gbps ~14.88M (14 880 952) ~812k (812743)

40Gbps ~59.52M ~3.25M

100Gbps ~148.8M ~8.127M

18

So 100M pps FDX requires 7x10Gbps ports… easy!

* 3B gap, 8B preamble, 60B payload, 4B CRC, no 802.1Q tag ** 1.86M with some cheating

Challenge 100Mpps – CPU cycles, budget estimation

For 10Gbps we have:

• 1277 ns per full-sized frame

• 67.2ns per small frame, this can be estimated as 200 cycles/frame on modern 3GHz CPU

For 40Gbps: 16,8 ns.

For 100Gbps: 6,7 ns.

19

Operation Time cost*

register <1ns (~1 cycle)

L1 cache ~1 ns (~3 cycles)

L2 cache ~4 ns

L3 cache ~8-12 ns

atomic lock+unlock 16 ns

cache miss / RAM access ~32-65 ns

syscall (beware of SELinux) 50–100 ns

sources: • „Network stack challenges at increasing speeds. The 100Gbit/s challenge”, RedHat 2015 • „HOW TO TEST 10 GIGABIT ETHERNET PERFORMANCE”, Spirent Whitepaper, 2012 • „The 7 Deadly Sins... of Packet Processing” from DPDK Summit Userspace, Oct 2015 • http://mechanical-sympathy.blogspot.co.uk/2013/02/cpu-cache-flushing-fallacy.html

* Note, these costs may vary between different CPU types, memories etc.

http://mechanical-sympathy.blogspot.co.uk/2013/02/cpu-cache-flushing-fallacy.html










So what is the problem?

• OS network stacks were not designed with these speeds in mind, they were designed as control planes, not data planes.

• We have many CPU cores these days, but some OS-es network stacks does not scale.

20

sources: • http://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf • „Shmoocon 2013 - C10M Defending The Internet At Scale”: https://www.youtube.com/watch?v=73XNtI0w7jA • http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html

http://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf

http://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf

https://www.youtube.com/watch?v=73XNtI0w7jA



http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html




















Hardware – is it capable?

Crucial components:

1. Network Interfaces

2. PCIe bus

3. memory

4. CPU

21

PCIe bus vs. NIC vs. memory

Interface Raw unidirectional speed Notes

Eth 10Gbps ~ 1250MB/s

Eth 2x10Gbps ~ 2500MB/s typically PCIe 2.0, 8x; but some NICs give ~80% speed with 64 frames FDX on both ports

Eth 40Gbps ~ 5000MB/s

PCIe 2.0 8x, 5GT/s ~ 4000 MB/s transport, ACK/transaction overhead; 8b/10b

PCIe 3.0, 4x ~ 3940 MB/s 128b/130b

PCIe 3.0, 8x, 8GT/s <8000 MB/s

PCIe 3.0, 16x, 8GT/s ~ 15754 MB/s

DDR3-1866, ~1,866GT/s ~ 14933 MB/s PC3-14900

22

NICs – speeds and vendors

• 10 Gbps – mature, too slow

• 25Gbps – gaining popularity (http://www.2550100.com/, http://25gethernet.org/)

• 40Gbps – too much

• 2x40Gbps – cheating

• 100Gbps, 2x100Gbs – available! PCIe can be splitted

• ensure port speeds matches PCIe bus width + overhead!

• some cards have internal limits

• some motherboards share bus lanes between slots

Some of the vendors:

• Chelsio, Emulex, Intel, Mellanox, QLogic, Solarflare…

• FPGA-based: Accolade Technology, Invea-Tech, Liberouter COMBO, Myricom, Napatech, …

23

http://www.2550100.com/



http://25gethernet.org/

http://25gethernet.org/

NICs – multiqueue and other features

• multiqueue

– e.g. 128 RX+TX pairs

– RX distribution by RSS or manual

– RSS hashing L3/L4 (ECMP like)

– Flow Director rules can be set with ethtool

• performance drop

• offloads: VLAN, checksum, encapsulation, etc.

• DCB and VM/VF features

• SR-IOV

• scatter & gather

• FPGA on board

• black magic

24

source: http://dpdk.org/doc/

NICs – examples

25

i40e

ixgbe

Modern Xeon CPUs – example

Architecture: x86_64

On-line CPU(s) list: 0-31

Thread(s) per core: 2

Core(s) per socket: 16

Model name: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz

CPU MHz: 1201.210

CPU max MHz: 3600.0000

CPU min MHz: 1200.0000

BogoMIPS: 4599.81

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 40960K

NUMA node0 CPU(s): 0-31

microarchitectures: Skylake (new, 2015Q4, E3 only for now), Broadwell (quite new), Haswell (this one)

26

Modern Xeon 1-2 socket CPUs

source: https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors

27

• V3: Haswell family

• V4: Broadwell family

• No Skylake E5s yet

• AVX2: a must!

• AVX512: maybe in 2017…

• DDIO: recommended

• CAT: interesting!

• NUMA: avoid QPI overhead

https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors

https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors

Advanced Vector Extensions (AVX)

• With AVX2, we have 256-bit registers and instructions

• Thanks to that one can calculate multiple operations „at once” (SIMD)

28

sources:

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

https://software.intel.com/en-us/node/513925

Intrinsics for Arithmetic Operations

Intrinsics for Arithmetic Shift Operations

Intrinsics for Blend Operations

Intrinsics for Bitwise Operations

Intrinsics for Broadcast Operations

Intrinsics for Compare Operations

Intrinsics for Fused Multiply Add Operations

Intrinsics for GATHER Operations

Intrinsics for Logical Shift Operations

Intrinsics for Insert/Extract Operations

Intrinsics for Masked Load/Store Operations

Intrinsics for Miscellaneous Operations

Intrinsics for Operations to Manipulate Integer Data at Bit-Granularity

Intrinsics for Pack/Unpack Operations

Intrinsics for Packed Move with Extend Operations

Intrinsics for Permute Operations

Intrinsics for Shuffle Operations

Intrinsics for Intel® Transactional Synchronization Extensions (Intel® TSX)









Advanced Vector Extensions (AVX) – example

static inline void clear_dropped_verdicts(uint32_t *vp, size_t n)

{

#ifdef __AVX2__

static_assert(ACL_USERDATA_DROP == (1u << 31), "AVX2 code assumes ACL_USERDATA_DROP == 2^31");

for (;;) {

__m256i dropmask = _mm256_loadu_si256((__m256i *)vp);

_mm256_maskstore_epi32((int *)vp, dropmask, _mm256_setzero_si256());

if (n <= 8)

break;

n -= 8;

vp += 8;

}

#else

for (size_t i = 0; i < n; ++i)

if (vp[i] & ACL_USERDATA_DROP)

vp[i] = 0;

#endif

}

code: redGuardian dataplane

explanation source: https://software.intel.com/en-us/node/513925

29

Loads integer values from the 256-bit unaligned memory location pointed to by *a, into a destination integer vector, which is returned by the intrinsic.

Conditionally stores 32-bit data elements from the source vector into the corresponding elements of the vector in memory referenced by addr. If an element of mask is 0, corresponding element of the result vector in memory stays unchanged. Only the most significant bit of each element in the vector mask is used.

Sets all the elements of an integer vector to zero and returns the integer vector.




Intel Data Direct I/O (DDIO)

# cpuid | grep 'direct cache access' | head -1

direct cache access = true

30

source: Intel® Data Direct I/O Technology (Intel® DDIO): A Primer / Technical Brief

NIC pushes data directly into CPU L3 cache.

Thus, in some usecases, there are no memory lookups at all.

Very cool!

Poor man’s TCAM?

Intel Cache Allocation Technology (CAT)

31

sources:

https://github.com/01org/intel-cmt-cat

http://danluu.com/intel-cat/

Allows CPU L3 cache partitioning.

But why?

• Cache eviction problem

• Low priority tasks won’t trash cache for high priority tasks, e.g. control plane vs. data plane on the same CPU socket

• Useful also in virtualized environments (e.g. some VMs need low latencies)

Supported on: E5-2658 v3, E5-2648L v3, E5-2628L v3, E5-2618L v3, E5-2608L v3 and E5-2658A v3, E3-1258L v4 and E3-1278L v4












Crazy idea

• So HW is capable, OS is not

• NICs use DMA and we already have Userspace I/O

• Let’s bypass OS network stack and work with NICs directly!

32

source: 123rf.com

Dataplane frameworks*

DPDK Netmap PF RING ZC Snabb Switch

OS Linux, FreeBSD FreeBSD, Linux Linux Linux

license BSD BSD LGPL 2.1 + paid ZC driver Apache 2.0

language C C C LuaJIT

bifurcated driver

- + + -

support & community

Intel, 6Wind and some Well Known Vendors

FreeBSD, EU funding ntop Snabb Co.

sample usecase appliances, NFV NFV, routing acceleration

packet interception IDS/IPS

NFV

notes extremely optimized huge library of components WIP: ARM, Power8 support

available OOTB, ongoing pfSense integration

simple examples included, opensource IDS drivers

less mature, innovative, DSL, Lua GC

33

* skipped: PacketShader I/O Engine, PFQ, ef_vi, OpenDataPlane API

DPDK simplified overview (1)

• DPDK comes as a complete set of modules

• everything runs around EAL

• physical NICs accessed via Poll Mode Drivers (requires UIO) – Some PMDs are not mature enough

• VM drivers are available as well

• packets are exchanged via rings

• libraries for hashing, route lookups, ACLs, QoS, encryption etc. provided

34

source: http://dpdk.readthedocs.org/en/latest/prog_guide

DPDK simplified overview (2) – components

RTE part Description What for?

ACL access-lists packet matching

LPM DIR-24-8 routing lookups

*hash calculate hashes based on packet headers state, ARP, flow lookups, etc.

crypto crypto devices IPsec VPNs acceleration

ring circular packet buffers HW/SW packet exchange

QoS metering, scheduling, RED QoS

packet framework

pipelines, table lookups complex packet flow, OpenFlow-like

… memory, locking, power, timing, etc.

35

What can we build with these tools?

• switch

• router

• stateless and stateful firewall

• IDS/IPS

• load balancer

• userland UDP stack

• userland TCP stack

• traffic recorder

• fast internet scanners

• stateless packet generator

• stateful, application-like flow generator

• IPsec VPN gateway

• tunnel broker

• accelerated key-value DB

• accelerated NAS (and there is also SPDK)

• …

36

Some insights

How to code?

• packet batching

• vectorization (AVX)

• memory preallocation and hugepages

• memory channel awareness

• data prefetching

• cache-friendly data structures, cache aligning

• no data copying

• no syscalls

• no locking, almost no atomics, compare and swap etc.

38

• polling (but not too often), no interrupts

• one thread per core

• branch prediction hints

• function inlining

• NUMA locality

• time measure – RDTSC is not for free

some of the sources:

• http://dpdk.readthedocs.org/en/latest/prog_guide/writing_efficient_code.html

• https://dpdksummit.com/Archive/pdf/DPDK-Dublin2015-SevenDeadlySinsPacketProcessing.pdf

• http://www.net.in.tum.de/fileadmin/bibtex/publications/theses/2014-gallenmueller-high-speed-packet-processing.pdf

• https://lwn.net/Articles/629155/

http://dpdk.readthedocs.org/en/latest/prog_guide/writing_efficient_code.html

http://dpdk.readthedocs.org/en/latest/prog_guide/writing_efficient_code.html

https://dpdksummit.com/Archive/pdf/DPDK-Dublin2015-SevenDeadlySinsPacketProcessing.pdf






http://www.net.in.tum.de/fileadmin/bibtex/publications/theses/2014-gallenmueller-high-speed-packet-processing.pdf












https://lwn.net/Articles/629155/

https://lwn.net/Articles/629155/

Multiple cores scaling vs. traffic policing and counters

39

• remove shared variables, get rid of

locking

• maintain separate dataset per core

• test with all cores available

• how to synchronize ratelimiters and borrow bandwidth between cores?

Automated regression tests are a must

• performance

• features

• local (pcap) and real NICs

• different drivers

40

$ make run-tests [...] ACL with drop rule: drops ... #03 passed ACL with no rules (empty): drops ... #04 passed [...] Not supported protocols are dropped ... #19 passed Packets with TTL<=1 are dropped ... #20 passed [...] MTU-sized IP packets are forwarded ... #25 passed IP len > reported frame len: dropped ... #26 passed IP len < reported frame len: truncated ... #27 passed [...]

----------------------------------- Perf tests on ixgbe: ----------------------------------- acl_limit RX/TX: 7139 / 9995 (min_rx 7134; max_rx 7143; dev_rx 2.5; dev_tx 1.8) acl_pass RX/TX: 7149 / 9996 (min_rx 6086; max_rx 7326; dev_rx 367.5; dev_tx 1.1) trivial RX/TX: 7862 / 10000 (min_rx 7658; max_rx 7903; dev_rx 68.3; dev_tx 0.7) long_acl RX/TX: 5502 / 9996 (min_rx 5498; max_rx 5506; dev_rx 2.0; dev_tx 0.3) -----------------------------------

Packet crafting with Scapy >>> p=Ether()/IP()/ICMP()

>>> p.show()

###[ Ethernet ]###

dst= ff:ff:ff:ff:ff:ff

src= 00:00:00:00:00:00

type= 0x800

###[ IP ]###

version= 4

ihl= None

tos= 0x0

len= None

id= 1

flags=

frag= 0

ttl= 64

proto= icmp

chksum= None

src= 127.0.0.1

dst= 127.0.0.1

\options\

###[ ICMP ]###

type= echo-request

code= 0

chksum= None

id= 0x0

seq= 0x0

41

def test_tcp_flags(self):

# pass syn,!ack

pkt1 = evalP(RAND_ETH / IP(src="1.2.3.4", dst="10.0.2.1") / TCP(sport=1, dport=2222, flags='S'))

pkt2 = evalP(RAND_ETH / IP(src="1.2.3.4", dst="10.0.2.1") / TCP(sport=1, dport=2222, flags='SA'))

pkt3 = evalP(RAND_ETH / IP(src="1.2.3.4", dst="10.0.2.1") / TCP(sport=1, dport=2222, flags='A'))

pkt4 = evalP(RAND_ETH / IP(src="1.2.3.4", dst="10.0.2.1") / TCP(sport=1, dport=2222, flags='SU'))

pkt5 = evalP(RAND_ETH / IP(src="1.2.3.4", dst="10.0.2.1") / TCP(sport=1, dport=2222, flags='U'))

# pass ns,!ack

# NS flag is represented by least significant bit in reserved area

pkt6 = evalP(RAND_ETH / IP(src="1.2.3.4", dst="10.0.2.1") / TCP(sport=1, dport=3333, flags='', reserved=1))

pkt7 = evalP(RAND_ETH / IP(src="1.2.3.4", dst="10.0.2.1") / TCP(sport=1, dport=3333, flags='A', reserved=1))

out = self.__test_forward(pkt1 + pkt2 + pkt3 + pkt4 + pkt5 + pkt6 + pkt7)

pkt_eq_(pkt1 + pkt4 + pkt6, out)

Gotcha! AssertionError: len(expected) != len(output) (9 != 8) first difference at 8: 'Ethernet/IP/TCP/Raw' != '<missing>'

Performance testing – granularity (1)

42

• RX performance snapshots, 1ms resolution

• average performance seems OK

• what if we look closer? • WTF?

• isolate cores, avoid cache

trashing

Performance testing – granularity (2)

• very nice

• but WTF?

• thermal interactions

• modern CPUs scale their clock and it is not always possible to control this

43

Profiling

• Workload is repeatable

• Sampling profilers are great for finding hot spots

• Simple changes can have huge performance impact

44

Summary

• PC may be faster than you think

• In-depth understading is required to develop fast, working solutions

• Commercial dataplane solutions are already here, virtual and physical

• „Worse Is Better”

45

Thank you for your attention!

BTW, we are hiring!

100 M pps on PC.

Software

Transcript of 100 M pps on PC.