PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A...
Transcript of PacketShader - SIGCOMMconferences.sigcomm.org/sigcomm/2010/slides/S6Han.pdf2010 Sep. PacketShader: A...
2010 Sep.
PacketShader:A GPU-Accelerated Software Router
Sangjin Han†
In collaboration with:
Keon Jang †, KyoungSoo Park‡, Sue Moon †
† Advanced Networking Lab, CS, KAIST‡ Networked and Distributed Computing Systems Lab, EE, KAIST
2010 Sep.2
PacketShader:A GPU-Accelerated Software Router
High-performance
Our prototype: 40 Gbps on a single box
2010 Sep.
Software Router
Despite its name, not limited to IP routing You can implement whatever you want on it.
Driven by software Flexible Friendly development environments
Based on commodity hardware Cheap Fast evolution
3
2010 Sep.
Now 10 Gigabit NIC is a commodity
From $200 – $300 per port Great opportunity for software routers
4
2010 Sep.
Achilles’ Heel of Software Routers
Low performance Due to CPU bottleneck
5
Year Ref. H/W IPv4 Throughput
2008 Egi et al. Two quad-core CPUs 3.5 Gbps
2008“Enhanced SR”Bolla et al.
Two quad-core CPUs 4.2 Gbps
2009“RouteBricks”Dobrescu et al.
Two quad-core CPUs(2.8 GHz)
8.7 Gbps
Not capable of supporting even a single 10G port
2010 Sep.
CPU BOTTLENECK
6
2010 Sep.
Per-Packet CPU Cycles for 10G
7
1,200 600
1,200 1,600Cycles needed
Packet I/O IPv4 lookup
= 1,800 cycles
= 2,800
Yourbudget
1,400 cycles
10G, min-sized packets, dual quad-core 2.66GHz CPUs
5,4001,200 … = 6,600
Packet I/O IPv6 lookup
Packet I/O Encryption and hashing
IPv4
IPv6
IPsec
+
+
+
(in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours)
2010 Sep.
Our Approach 1: I/O Optimization
8
Packet I/O
Packet I/O
Packet I/O
Packet I/O
1,200 reduced to 200 cycles per packet Main ideas
Huge packet buffer Batch processing
600
1,600
IPv4 lookup
= 1,800 cycles
= 2,800
5,400 … = 6,600
IPv6 lookup
Encryption and hashing
+
+
+
1,200
1,200
1,200
2010 Sep.
Our Approach 2: GPU Offloading
9
Packet I/O
Packet I/O
Packet I/O
GPU Offloading for Memory-intensive or Compute-intensive
operations
Main topic of this talk
600
1,600
IPv4 lookup
5,400 …
IPv6 lookup
Encryption and hashing
+
+
+
2010 Sep.
WHAT IS GPU?
10
2010 Sep.
GPU = Graphics Processing Unit
The heart of graphics cards Mainly used for real-time 3D game rendering Massively-parallel processing capacity
11
(Ubisoft’s AVARTAR, from http://ubi.com)
2010 Sep.
CPU vs. GPU
12
CPU:Small # of super-fast cores
GPU:Large # of small cores
2010 Sep.
“Silicon Budget” in CPU and GPU
13
Xeon X5550:
4 cores
731M transistorsGTX480:
480 cores
3,200M transistors
ALU
2010 Sep.
GPU FOR PACKET PROCESSING
14
2010 Sep.
Advantages of GPU for Packet Processing
1. Raw computation power
2. Memory access latency
3. Memory bandwidth
Comparison between Intel X5550 CPU NVIDIA GTX480 GPU
15
2010 Sep.
(1/3) Raw Computation Power
Compute-intensive operations in software routers Hashing, encryption, pattern matching, network coding,
compression, etc. GPU can help!
16
CPU: 43×109
= 2.66 (GHz) ×4 (# of cores) ×
4 (4-way superscalar)
GPU: 672×109
= 1.4 (GHz) ×480 (# of cores)
Instructions/sec
<
2010 Sep.
(2/3) Memory Access Latency
Software router lots of cache misses GPU can effectively hide memory latency
17
GPU core
Cachemiss
Cachemiss
Switch to Thread 2
Switch to Thread 3
2010 Sep.
(3/3) Memory Bandwidth
18
CPU’s memory bandwidth (theoretical): 32 GB/s
2010 Sep.
(3/3) Memory Bandwidth
19
CPU’s memory bandwidth (empirical) < 25 GB/s
4. TX: RAM NIC
3. TX: CPU RAM2. RX:
RAM CPU
1. RX: NIC RAM
2010 Sep.
(3/3) Memory Bandwidth
20
Your budget for packet processing can be less 10 GB/s
2010 Sep.
(3/3) Memory Bandwidth
21
Your budget for packet processing can be less 10 GB/sGPU’s memory bandwidth: 174GB/s
2010 Sep.
HOW TO USE GPU
22
2010 Sep.
Basic Idea
23
Offload core operations to GPU(e.g., forwarding table lookup)
2010 Sep.
Recap
24
GTX480: 480 cores
For GPU, more parallelism, more throughput
2010 Sep.
Parallelism in Packet Processing
The key insight Stateless packet processing = parallelizable
25
RX queue
1. Batching
2. Parallel Processingin GPU
2010 Sep.
Batching Long Latency?
Fast link = enough # of packets in a small time window
10 GbE link up to 1,000 packets only in 67μs
Much less time with 40 or 100 GbE
26
2010 Sep.
PACKETSHADER DESIGN
27
2010 Sep.
Basic Design
Three stages in a streamline
28
Pre-shader Shader Post-
shader
2010 Sep.
Packet’s Journey (1/3)
IPv4 forwarding example
29
Pre-shader Shader Post-
shader
• Checksum, TTL• Format check• … Collected
dst. IP addrs
Some packetsgo to slow-path
2010 Sep.
Packet’s Journey (2/3)
IPv4 forwarding example
30
Pre-shader Shader Post-
shader
1. IP addresses
2. Forwarding table lookup
3. Next hops
2010 Sep.
Packet’s Journey (3/3)
IPv4 forwarding example
31
Pre-shader Shader Post-
shader
Update packets and transmit
2010 Sep.
Interfacing with NICs
32
Pre-shader Shader Post-
shaderDevice driver
Packet RX
Device driver
Packet TX
2010 Sep.
Device driver
Pre-shader
Shader
Post-shader
Device driver
Scaling with a Multi-Core CPU
Master core
Worker cores
33
2010 Sep.34
Device driver
Pre-shader
Shader
Post-shader
Device driver
Shader
Scaling with Multiple Multi-Core CPUs
2010 Sep.
EVALUATION
35
2010 Sep.
Hardware Setup
36
CPU:
Quad-core, 2.66 GHz
GPU:
NIC: Total 80 Gbps
Dual-port 10 GbE
Total 8 CPU cores
480 cores, 1.4 GHz
Total 960 cores
2010 Sep.
Experimental Setup
37
…
8 × 10 GbE linksPacket generator PacketShader
(Up to 80 Gbps)
Input traffic
Processed packets
2010 Sep.
Results (w/ 64B packets)
38
28.2
8
15.6
3
39.2 38.232
10.2
05
10152025303540
IPv4 IPv6 OpenFlow IPsec
Thr
ough
put (
Gbp
s)
CPU-only CPU+GPU
1.4x 4.8x 2.1x 3.5xGPU speedup
2010 Sep.
Example 1: IPv6 forwarding
Longest prefix matching on 128-bit IPv6 addresses
Algorithm: binary search on hash tables [Waldvogel97] 7 hashings + 7 memory accesses
39
… … … …
Prefix length 1 64 1289680
2010 Sep.
Example 1: IPv6 forwarding
(Routing table was randomly generated with 200K entries)
40
05
1015202530354045
64 128 256 512 1024 1514
Thr
ough
put (
Gbp
s)
Packet size (bytes)
CPU-only CPU+GPU
Bounded by motherboardIO capacity
2010 Sep.
Example 2: IPsec tunneling
41
IP header IP payload
IP header IP payload ESPtrailer
ESP (Encapsulating Security Payload) Tunnel mode with AES-CTR (encryption) and SHA1 (authentication)
+
Original IP packet
IP header IP payload ESPtrailer
ESPheader
+
IP header IP payload ESPtrailer
ESPheader
ESPAuth.
New IPheader
+IPsec Packet
1. AES
2. SHA1
2010 Sep.
Example 2: IPsec tunneling
3.5x speedup
42
1
1.5
2
2.5
3
3.5
4
0
4
8
12
16
20
24
64 128 256 512 1024 1514
Spee
dup
Thr
ough
put (
Gbp
s)
Packet size (bytes)
CPU-only CPU+GPU Speedup
2010 Sep.43
Year Ref. H/W IPv4Throughput
2008 Egi et al. Two quad-core CPUs 3.5 Gbps
2008 “Enhanced SR”Bolla et al.
Two quad-core CPUs 4.2 Gbps
2009 “RouteBricks”Dobrescu et al.
Two quad-core CPUs(2.8 GHz)
8.7 Gbps
2010 PacketShader(CPU-only)
Two quad-core CPUs(2.66 GHz)
28.2 Gbps
2010 PacketShader(CPU+GPU)
Two quad-core CPUs+ two GPUs
39.2 Gbps
Kernel
User
2010 Sep.
Conclusions
GPU a great opportunity for fast packet processing
PacketShader Optimized packet I/O + GPU acceleration scalable with
• # of multi-core CPUs, GPUs, and high-speed NICs
Current Prototype Supports IPv4, IPv6, OpenFlow, and IPsec 40 Gbps performance on a single PC
44
2010 Sep.
Future Work
Control plane integration Dynamic routing protocols with Quagga or Xorp
Multi-functional, modular programming environment Integration with Click? [Kohler99]
Opportunistic offloading CPU at low load GPU at high load
Stateful packet processing
45