PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A...

45
2010 Sep. PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang , KyoungSoo Park , Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab, EE, KAIST

Transcript of PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A...

Page 1: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

PacketShader:A GPU-Accelerated Software Router

Sangjin Han†

In collaboration with:

Keon Jang †, KyoungSoo Park‡, Sue Moon †

† Advanced Networking Lab, CS, KAIST‡ Networked and Distributed Computing Systems Lab, EE, KAIST

Page 2: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.2

PacketShader:A GPU-Accelerated Software Router

High-performance

Our prototype: 40 Gbps on a single box

Page 3: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Software Router

Despite its name, not limited to IP routing You can implement whatever you want on it.

Driven by software Flexible Friendly development environments

Based on commodity hardware Cheap Fast evolution

3

Page 4: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Now 10 Gigabit NIC is a commodity

From $200 – $300 per port Great opportunity for software routers

4

Page 5: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Achilles’ Heel of Software Routers

Low performance Due to CPU bottleneck

5

Year Ref. H/W IPv4 Throughput

2008 Egi et al. Two quad-core CPUs 3.5 Gbps

2008“Enhanced SR”Bolla et al.

Two quad-core CPUs 4.2 Gbps

2009“RouteBricks”Dobrescu et al.

Two quad-core CPUs(2.8 GHz)

8.7 Gbps

Not capable of supporting even a single 10G port

Page 6: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

CPU BOTTLENECK

6

Page 7: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Per-Packet CPU Cycles for 10G

7

1,200 600

1,200 1,600Cycles needed

Packet I/O IPv4 lookup

= 1,800 cycles

= 2,800

Yourbudget

1,400 cycles

10G, min-sized packets, dual quad-core 2.66GHz CPUs

5,4001,200 … = 6,600

Packet I/O IPv6 lookup

Packet I/O Encryption and hashing

IPv4

IPv6

IPsec

+

+

+

(in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours)

Page 8: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Our Approach 1: I/O Optimization

8

Packet I/O

Packet I/O

Packet I/O

Packet I/O

1,200 reduced to 200 cycles per packet Main ideas

Huge packet buffer Batch processing

600

1,600

IPv4 lookup

= 1,800 cycles

= 2,800

5,400 … = 6,600

IPv6 lookup

Encryption and hashing

+

+

+

1,200

1,200

1,200

Page 9: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Our Approach 2: GPU Offloading

9

Packet I/O

Packet I/O

Packet I/O

GPU Offloading for Memory-intensive or Compute-intensive

operations

Main topic of this talk

600

1,600

IPv4 lookup

5,400 …

IPv6 lookup

Encryption and hashing

+

+

+

Page 10: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

WHAT IS GPU?

10

Page 11: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

GPU = Graphics Processing Unit

The heart of graphics cards Mainly used for real-time 3D game rendering Massively-parallel processing capacity

11

(Ubisoft’s AVARTAR, from http://ubi.com)

Page 12: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

CPU vs. GPU

12

CPU:Small # of super-fast cores

GPU:Large # of small cores

Page 13: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

“Silicon Budget” in CPU and GPU

13

Xeon X5550:

4 cores

731M transistorsGTX480:

480 cores

3,200M transistors

ALU

Page 14: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

GPU FOR PACKET PROCESSING

14

Page 15: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Advantages of GPU for Packet Processing

1. Raw computation power

2. Memory access latency

3. Memory bandwidth

Comparison between Intel X5550 CPU NVIDIA GTX480 GPU

15

Page 16: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

(1/3) Raw Computation Power

Compute-intensive operations in software routers Hashing, encryption, pattern matching, network coding,

compression, etc. GPU can help!

16

CPU: 43×109

= 2.66 (GHz) ×4 (# of cores) ×

4 (4-way superscalar)

GPU: 672×109

= 1.4 (GHz) ×480 (# of cores)

Instructions/sec

<

Page 17: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

(2/3) Memory Access Latency

Software router lots of cache misses GPU can effectively hide memory latency

17

GPU core

Cachemiss

Cachemiss

Switch to Thread 2

Switch to Thread 3

Page 18: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

(3/3) Memory Bandwidth

18

CPU’s memory bandwidth (theoretical): 32 GB/s

Page 19: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

(3/3) Memory Bandwidth

19

CPU’s memory bandwidth (empirical) < 25 GB/s

4. TX: RAM NIC

3. TX: CPU RAM2. RX:

RAM CPU

1. RX: NIC RAM

Page 20: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

(3/3) Memory Bandwidth

20

Your budget for packet processing can be less 10 GB/s

Page 21: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

(3/3) Memory Bandwidth

21

Your budget for packet processing can be less 10 GB/sGPU’s memory bandwidth: 174GB/s

Page 22: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

HOW TO USE GPU

22

Page 23: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Basic Idea

23

Offload core operations to GPU(e.g., forwarding table lookup)

Page 24: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Recap

24

GTX480: 480 cores

For GPU, more parallelism, more throughput

Page 25: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Parallelism in Packet Processing

The key insight Stateless packet processing = parallelizable

25

RX queue

1. Batching

2. Parallel Processingin GPU

Page 26: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Batching Long Latency?

Fast link = enough # of packets in a small time window

10 GbE link up to 1,000 packets only in 67μs

Much less time with 40 or 100 GbE

26

Page 27: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

PACKETSHADER DESIGN

27

Page 28: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Basic Design

Three stages in a streamline

28

Pre-shader Shader Post-

shader

Page 29: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Packet’s Journey (1/3)

IPv4 forwarding example

29

Pre-shader Shader Post-

shader

• Checksum, TTL• Format check• … Collected

dst. IP addrs

Some packetsgo to slow-path

Page 30: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Packet’s Journey (2/3)

IPv4 forwarding example

30

Pre-shader Shader Post-

shader

1. IP addresses

2. Forwarding table lookup

3. Next hops

Page 31: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Packet’s Journey (3/3)

IPv4 forwarding example

31

Pre-shader Shader Post-

shader

Update packets and transmit

Page 32: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Interfacing with NICs

32

Pre-shader Shader Post-

shaderDevice driver

Packet RX

Device driver

Packet TX

Page 33: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Device driver

Pre-shader

Shader

Post-shader

Device driver

Scaling with a Multi-Core CPU

Master core

Worker cores

33

Page 34: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.34

Device driver

Pre-shader

Shader

Post-shader

Device driver

Shader

Scaling with Multiple Multi-Core CPUs

Page 35: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

EVALUATION

35

Page 36: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Hardware Setup

36

CPU:

Quad-core, 2.66 GHz

GPU:

NIC: Total 80 Gbps

Dual-port 10 GbE

Total 8 CPU cores

480 cores, 1.4 GHz

Total 960 cores

Page 37: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Experimental Setup

37

8 × 10 GbE linksPacket generator PacketShader

(Up to 80 Gbps)

Input traffic

Processed packets

Page 38: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Results (w/ 64B packets)

38

28.2

8

15.6

3

39.2 38.232

10.2

05

10152025303540

IPv4 IPv6 OpenFlow IPsec

Thr

ough

put (

Gbp

s)

CPU-only CPU+GPU

1.4x 4.8x 2.1x 3.5xGPU speedup

Page 39: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Example 1: IPv6 forwarding

Longest prefix matching on 128-bit IPv6 addresses

Algorithm: binary search on hash tables [Waldvogel97] 7 hashings + 7 memory accesses

39

… … … …

Prefix length 1 64 1289680

Page 40: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Example 1: IPv6 forwarding

(Routing table was randomly generated with 200K entries)

40

05

1015202530354045

64 128 256 512 1024 1514

Thr

ough

put (

Gbp

s)

Packet size (bytes)

CPU-only CPU+GPU

Bounded by motherboardIO capacity

Page 41: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Example 2: IPsec tunneling

41

IP header IP payload

IP header IP payload ESPtrailer

ESP (Encapsulating Security Payload) Tunnel mode with AES-CTR (encryption) and SHA1 (authentication)

+

Original IP packet

IP header IP payload ESPtrailer

ESPheader

+

IP header IP payload ESPtrailer

ESPheader

ESPAuth.

New IPheader

+IPsec Packet

1. AES

2. SHA1

Page 42: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Example 2: IPsec tunneling

3.5x speedup

42

1

1.5

2

2.5

3

3.5

4

0

4

8

12

16

20

24

64 128 256 512 1024 1514

Spee

dup

Thr

ough

put (

Gbp

s)

Packet size (bytes)

CPU-only CPU+GPU Speedup

Page 43: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.43

Year Ref. H/W IPv4Throughput

2008 Egi et al. Two quad-core CPUs 3.5 Gbps

2008 “Enhanced SR”Bolla et al.

Two quad-core CPUs 4.2 Gbps

2009 “RouteBricks”Dobrescu et al.

Two quad-core CPUs(2.8 GHz)

8.7 Gbps

2010 PacketShader(CPU-only)

Two quad-core CPUs(2.66 GHz)

28.2 Gbps

2010 PacketShader(CPU+GPU)

Two quad-core CPUs+ two GPUs

39.2 Gbps

Kernel

User

Page 44: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Conclusions

GPU a great opportunity for fast packet processing

PacketShader Optimized packet I/O + GPU acceleration scalable with

• # of multi-core CPUs, GPUs, and high-speed NICs

Current Prototype Supports IPv4, IPv6, OpenFlow, and IPsec 40 Gbps performance on a single PC

44

Page 45: PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A GPU-Accelerated Software Router SangjinHan † In collaboration with: KeonJang

2010 Sep.

Future Work

Control plane integration Dynamic routing protocols with Quagga or Xorp

Multi-functional, modular programming environment Integration with Click? [Kohler99]

Opportunistic offloading CPU at low load GPU at high load

Stateful packet processing

45