PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A...

2010 Sep.

PacketShader:A GPU-Accelerated Software Router

Sangjin Han†

In collaboration with:

Keon Jang †, KyoungSoo Park‡, Sue Moon †

† Advanced Networking Lab, CS, KAIST‡ Networked and Distributed Computing Systems Lab, EE, KAIST

2010 Sep.2

PacketShader:A GPU-Accelerated Software Router

High-performance

Our prototype: 40 Gbps on a single box

2010 Sep.

Software Router

Despite its name, not limited to IP routing You can implement whatever you want on it.

Driven by software Flexible Friendly development environments

Based on commodity hardware Cheap Fast evolution

3

2010 Sep.

Now 10 Gigabit NIC is a commodity

From $200 – $300 per port Great opportunity for software routers

4

2010 Sep.

Achilles’ Heel of Software Routers

Low performance Due to CPU bottleneck

5

Year Ref. H/W IPv4 Throughput

2008 Egi et al. Two quad-core CPUs 3.5 Gbps

2008“Enhanced SR”Bolla et al.

Two quad-core CPUs 4.2 Gbps

2009“RouteBricks”Dobrescu et al.

Two quad-core CPUs(2.8 GHz)

8.7 Gbps

Not capable of supporting even a single 10G port

2010 Sep.

CPU BOTTLENECK

6

2010 Sep.

Per-Packet CPU Cycles for 10G

7

1,200 600

1,200 1,600Cycles needed

Packet I/O IPv4 lookup

= 1,800 cycles

= 2,800

Yourbudget

1,400 cycles

10G, min-sized packets, dual quad-core 2.66GHz CPUs

5,4001,200 … = 6,600

Packet I/O IPv6 lookup

Packet I/O Encryption and hashing

IPv4

IPv6

IPsec

+

+

+

(in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours)

2010 Sep.

Our Approach 1: I/O Optimization

8

Packet I/O

Packet I/O

Packet I/O

Packet I/O

1,200 reduced to 200 cycles per packet Main ideas

Huge packet buffer Batch processing

600

1,600

IPv4 lookup

= 1,800 cycles

= 2,800

5,400 … = 6,600

IPv6 lookup

Encryption and hashing

+

+

+

1,200

1,200

1,200

2010 Sep.

Our Approach 2: GPU Offloading

9

Packet I/O

Packet I/O

Packet I/O

GPU Offloading for Memory-intensive or Compute-intensive

operations

Main topic of this talk

600

1,600

IPv4 lookup

5,400 …

IPv6 lookup

Encryption and hashing

+

+

+

2010 Sep.

WHAT IS GPU?

10

2010 Sep.

GPU = Graphics Processing Unit

The heart of graphics cards Mainly used for real-time 3D game rendering Massively-parallel processing capacity

11

(Ubisoft’s AVARTAR, from http://ubi.com)

2010 Sep.

CPU vs. GPU

12

CPU:Small # of super-fast cores

GPU:Large # of small cores

2010 Sep.

“Silicon Budget” in CPU and GPU

13

Xeon X5550:

4 cores

731M transistorsGTX480:

480 cores

3,200M transistors

ALU

2010 Sep.

GPU FOR PACKET PROCESSING

14

2010 Sep.

Advantages of GPU for Packet Processing

1. Raw computation power

2. Memory access latency

3. Memory bandwidth

Comparison between Intel X5550 CPU NVIDIA GTX480 GPU

15

2010 Sep.

(1/3) Raw Computation Power

Compute-intensive operations in software routers Hashing, encryption, pattern matching, network coding,

compression, etc. GPU can help!

16

CPU: 43×109

= 2.66 (GHz) ×4 (# of cores) ×

4 (4-way superscalar)

GPU: 672×109

= 1.4 (GHz) ×480 (# of cores)

Instructions/sec

<

2010 Sep.

(2/3) Memory Access Latency

Software router lots of cache misses GPU can effectively hide memory latency

17

GPU core

Cachemiss

Cachemiss

Switch to Thread 2

Switch to Thread 3

2010 Sep.

(3/3) Memory Bandwidth

18

CPU’s memory bandwidth (theoretical): 32 GB/s

2010 Sep.


19

CPU’s memory bandwidth (empirical) < 25 GB/s

4. TX: RAM NIC

3. TX: CPU RAM2. RX:

RAM CPU

1. RX: NIC RAM

2010 Sep.


20

Your budget for packet processing can be less 10 GB/s

2010 Sep.


21

Your budget for packet processing can be less 10 GB/sGPU’s memory bandwidth: 174GB/s

2010 Sep.

HOW TO USE GPU

22

2010 Sep.

Basic Idea

23

Offload core operations to GPU(e.g., forwarding table lookup)

2010 Sep.

Recap

24

GTX480: 480 cores

For GPU, more parallelism, more throughput

2010 Sep.

Parallelism in Packet Processing

The key insight Stateless packet processing = parallelizable

25

RX queue

1. Batching

2. Parallel Processingin GPU

2010 Sep.

Batching Long Latency?

Fast link = enough # of packets in a small time window

10 GbE link up to 1,000 packets only in 67μs

Much less time with 40 or 100 GbE

26

2010 Sep.

PACKETSHADER DESIGN

27

2010 Sep.

Basic Design

Three stages in a streamline

28

Pre-shader Shader Post-

shader

2010 Sep.

Packet’s Journey (1/3)

IPv4 forwarding example

29


shader

• Checksum, TTL• Format check• … Collected

dst. IP addrs

Some packetsgo to slow-path

2010 Sep.



30


shader

1. IP addresses

2. Forwarding table lookup

3. Next hops

2010 Sep.



31


shader

Update packets and transmit

2010 Sep.

Interfacing with NICs

32


shaderDevice driver

Packet RX

Device driver

Packet TX

2010 Sep.

Device driver

Pre-shader

Shader

Post-shader

Device driver

Scaling with a Multi-Core CPU

Master core

Worker cores

33

2010 Sep.34

Device driver

Pre-shader

Shader

Post-shader

Device driver

Shader

Scaling with Multiple Multi-Core CPUs

2010 Sep.

EVALUATION

35

2010 Sep.

Hardware Setup

36

CPU:

Quad-core, 2.66 GHz

GPU:

NIC: Total 80 Gbps

Dual-port 10 GbE

Total 8 CPU cores

480 cores, 1.4 GHz

Total 960 cores

2010 Sep.

Experimental Setup

37

…

8 × 10 GbE linksPacket generator PacketShader

(Up to 80 Gbps)

Input traffic

Processed packets

2010 Sep.

Results (w/ 64B packets)

38

28.2

8

15.6

3

39.2 38.232

10.2

05

10152025303540

IPv4 IPv6 OpenFlow IPsec

Thr

ough

put (

Gbp

s)

CPU-only CPU+GPU

1.4x 4.8x 2.1x 3.5xGPU speedup

2010 Sep.

Example 1: IPv6 forwarding

Longest prefix matching on 128-bit IPv6 addresses

Algorithm: binary search on hash tables [Waldvogel97] 7 hashings + 7 memory accesses

39

… … … …

Prefix length 1 64 1289680

2010 Sep.

Example 1: IPv6 forwarding

(Routing table was randomly generated with 200K entries)

40

05

1015202530354045

64 128 256 512 1024 1514

Thr

ough

put (

Gbp

s)

Packet size (bytes)

CPU-only CPU+GPU

Bounded by motherboardIO capacity

2010 Sep.

Example 2: IPsec tunneling

41

IP header IP payload

IP header IP payload ESPtrailer

ESP (Encapsulating Security Payload) Tunnel mode with AES-CTR (encryption) and SHA1 (authentication)

+

Original IP packet


ESPheader

+


ESPheader

ESPAuth.

New IPheader

+IPsec Packet

1. AES

2. SHA1

2010 Sep.

Example 2: IPsec tunneling

3.5x speedup

42

1

1.5

2

2.5

3

3.5

4

0

4

8

12

16

20

24

64 128 256 512 1024 1514

Spee

dup

Thr

ough

put (

Gbp

s)

Packet size (bytes)

CPU-only CPU+GPU Speedup

2010 Sep.43

Year Ref. H/W IPv4Throughput

2008 Egi et al. Two quad-core CPUs 3.5 Gbps

2008 “Enhanced SR”Bolla et al.

Two quad-core CPUs 4.2 Gbps

2009 “RouteBricks”Dobrescu et al.


8.7 Gbps

2010 PacketShader(CPU-only)


28.2 Gbps

2010 PacketShader(CPU+GPU)

Two quad-core CPUs+ two GPUs

39.2 Gbps

Kernel

User

2010 Sep.

Conclusions

GPU a great opportunity for fast packet processing

PacketShader Optimized packet I/O + GPU acceleration scalable with

• # of multi-core CPUs, GPUs, and high-speed NICs

Current Prototype Supports IPv4, IPv6, OpenFlow, and IPsec 40 Gbps performance on a single PC

44

2010 Sep.

Future Work

Control plane integration Dynamic routing protocols with Quagga or Xorp

Multi-functional, modular programming environment Integration with Click? [Kohler99]

Opportunistic offloading CPU at low load GPU at high load

Stateful packet processing

45

PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A...

Documents

Transcript of PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A...