mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf ·...

22
mTCP : A Highly Scalable User - level TCP Stack for Multicore Systems EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong Sunghwan Ihm*, Dongsu Han, and KyoungSoo Park KAIST * Princeton University

Transcript of mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf ·...

Page 1: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

mTCP: A Highly Scalable User-level TCP

Stack for Multicore Systems

EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong

Sunghwan Ihm*, Dongsu Han, and KyoungSoo Park

KAIST * Princeton University

Page 2: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Needs for Handling Many Short Flows

2

End systems- Web servers

61%

91%

0%

20%

40%

60%

80%

100%

0 4K 16K 64K 256K 1M

CD

F

FLOW SIZE (BYTES)

Flow count

* Commercial cellular traffic for 7 daysComparison of Caching Strategies in Modern Cellular Backhaul Networks, MOBISYS 2013

32K

Middleboxes- SSL proxies- Network caches

Page 3: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Unsatisfactory Performance of Linux TCP

3

• Large flows: Easy to fill up 10 Gbps

• Small flows: Hard to fill up 10 Gbps regardless of # cores – Too many packets:

14.88 Mpps for 64B packets in a 10 Gbps link

– Kernel is not designed well for multicore systems

0.0

0.5

1.0

1.5

2.0

2.5

1 2 4 6 8

Co

nn

ect

ion

s/se

c (x

10

5)

Number of CPU Cores

TCP Connection Setup Performance

Linux: 3.10.16Intel Xeon E5-2690Intel 10Gbps NIC

Performance meltdown

Page 4: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Kernel Uses the Most CPU Cycles

4

83% of CPU usage spent inside kernel!

Performance bottlenecks 1. Shared resources2. Broken locality3. Per packet processing

1) Efficient use of CPU cycles for TCP/IP processing 2.35x more CPU cycles for app

2) 3x ~ 25x better performance

Bottleneck removed by mTCPKernel

(without TCP/IP)45%

Packet I/O4%

TCP/IP34%

Application17%

CPU Usage Breakdown of Web ServerWeb server (Lighttpd) Serving a 64 byte fileLinux-3.10

Page 5: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Inefficiencies in Kernel from Shared FD

1. Shared resources

– Shared listening queue

– Shared file descriptor space

5

Per-core packet queue

Receive-Side Scaling (H/W)

Core 0 Core 1 Core 3Core 2

Listening queue

Lock

File descriptor space

Linear search for finding empty slot

Page 6: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Inefficiencies in Kernel from Broken Locality

2. Broken locality

6

Per-core packet queue

Receive-Side Scaling (H/W)

Core 0 Core 1 Core 3Core 2

Interrupt handle

accept()read()write()

Interrupt handling core != accepting core

Page 7: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Inefficiencies in Kernel from Lack of Support for Batching

3. Per packet, per system call processing

Inefficient per packet processing

Frequent mode switchingCache pollution

Per packet memory allocation

Inefficient per system call processing

7

accept(), read(), write()

Packet I/O

Kernel TCP

Application thread

BSD socket LInux epollKernel

User

Page 8: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Previous Works on Solving Kernel Complexity

8

Listening queue

Connection locality

App <-> TCP comm.

Packet I/O API

Linux-2.6 Shared No Per system call Per packet BSD

Linux-3.9SO_REUSEPORT

Per-core No Per system call Per packet BSD

Affinity-Accept Per-core Yes Per system call Per packet BSD

MegaPipe Per-core YesBatchedsystem call

Per packet custom

How much performance improvement can we getif we implement a user-level TCP stack with all optimizations?

Still, 78% of CPU cycles are used in kernel!

Page 9: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Clean-slate Design Principles of mTCP

9

• mTCP: A high-performance user-level TCP designed for multicore systems

• Clean-slate approach to divorce kernel’s complexity

Each core works independently

– No shared resources

– Resources affinity

1. Shared resources

2. Broken locality

Problems Our contributions

Batching from flow processing from packet I/O to user API

3. Lack of support for batching

Easily portable APIs for compatibility

Page 10: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Overview of mTCP Architecture

10

1. Thread model: Pairwise, per-core threading

2. Batching from packet I/O to application

3. mTCP API: Easily portable API (BSD-like)

User-level packet I/O library (PSIO)

mTCP thread 0 mTCP thread 1

ApplicationThread 0

ApplicationThread 1

mTCP socket mTCP epoll

NIC device driver Kernel-level

12

3

User-level

Core 0 Core 1

• [SIGCOMM’10] PacketShader: A GPU-accelerated software router, http://shader.kaist.edu/packetshader/io_engine/index.html

Page 11: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

1. Thread Model: Pairwise, Per-core Threading

11

User-level packet I/O library (PSIO)

mTCP thread 0 mTCP thread 1

ApplicationThread 0

ApplicationThread 1

mTCP socket mTCP epoll

Device driver

Core 0 Core 1

Symmetric Receive-Side Scaling (H/W)

Per-core packet queue

Per-core listening queue

Per-core file descriptor

Kernel-level

User-level

Page 12: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

From System Call to Context Switching

12

Packet I/O

Kernel TCP

Application thread

BSD socket LInux epoll

User-level packet I/O library

mTCP thread

Application Thread

NIC device driver

mTCP socket mTCP epollKernel

User

Linux TCP mTCP

System call Context switching

Page 13: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

From System Call to Context Switching

13

Packet I/O

Kernel TCP

Application thread

BSD socket LInux epoll

User-level packet I/O library

mTCP thread

Application Thread

NIC device driver

mTCP socket mTCP epollKernel

User

Linux TCP mTCP

System call Context switching

higher overhead

<Batching to amortizecontext switch overhead

Page 14: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

2. Batching process in mTCP thread

Acceptqueue

S S/A F/A

Data list

ACK list

Control list

TX manager

Connectqueue

Application thread

SYN RST FIN

Data list

Writequeue

Closequeue

write()

Rx manager

Payload handler

Socket API

Internal event queue

Eventqueue

ACK list

Control list

TX manager

connect() close()epoll_wait()accept()

SYN ACK

Data

14Rx queue Tx queue

mTCP thread

Page 15: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

3. mTCP API: Similar to BSD Socket API

• Two goals: Easy porting + keeping popular event model

• Ease of porting

– Just attach “mtcp_” to BSD socket API

– socket() mtcp_socket(), accept() mtcp_accept(), etc.

• Event notification: Readiness model using epoll()

• Porting existing applications

– Mostly less than 100 lines of code change

15

Application Description Modified lines / Total lines

Lighttpd An event-driven web server 65 / 40K

ApacheBench A webserver performance benchmark tool 29 / 66K

SSLShader A GPU-accelerated SSL proxy [NSDI ’11] 43 / 6,618

WebReplay A web log replayer 81 / 3,366

Page 16: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Optimizations for Performance

16

• Lock-free data structures

• Cache-friendly data structure

• Hugepages for preventing TLB missing

• Efficient TCP timer management

• Priority-based packet queuing

• Lightweight connection setup

• ……

Please refer to our paper

Page 17: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

mTCP Implementation

17

• 11,473 lines (C code)

• Packet I/O, TCP flow management, User-level socket API, Event system library

• 552 lines to patch the PSIO library

• Support event-driven packet I/O: ps_select()

• TCP implementation

• Follows RFC793

• Congestion control algorithm: NewReno

• Passing correctness test and stress test with Linux TCP stack

Page 18: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Evaluation

18

• Scalability with multicore

• Comparison of performance of multicore with previous solutions

• Performance improvement on ported applications

• Web Server (Lighttpd) – Performance under the real workload

• SSL proxy (SSL Shader, NSDI 11) – TCP bottlenecked application

Page 19: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

0

3

6

9

12

15

0 2 4 6 8

Tran

sact

ion

s/se

c (x

10

5 )

Number of CPU Cores

Linux REUSEPORT MegaPipe mTCP

1

Multicore Scalability

• 64B ping/pong messages per connection• Heavy connection overhead, small packet processing overhead• 25x Linux, 5x SO_REUSEPORT*[LINUX3.9], 3x MegaPipe*[OSDI’12]

19

Shared fd in process

Shared listen socket

* [LINUX3.9] https://lwn.net/Articles/542629/* [OSDI’12] MegaPipe: A New Programming Interface for Scalable Network I/O, Berkeley

Inefficient small packet processing in Kernel

0

3

6

9

12

15

0

3

6

9

12

15

0

3

6

9

12

15

0

3

6

9

12

15 Linux: 3.10.12Intel Xeon E5-269032GB RAMIntel 10Gbps NIC

Page 20: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Performance Improvement on Ported Applications

Web Server (Lighttpd)• Real traffic workload: Static file

workload from SpecWeb2009 set

• 3.2x faster than Linux

• 1.5x faster than MegaPipe

SSL Proxy (SSLShader)• Performance Bottleneck in TCP

• Cipher suite1024-bit RSA, 128-bit AES, HMAC-SHA1

• Download 1-byte object via HTTPS

• 18% ~ 33% better on SSL handshake

20

1.24 1.79

2.69

4.02

0

1

2

3

4

5

Linux REUSEPORT MegaPipe mTCP

Thro

ugh

pu

t (G

bp

s) 26,762 28,208 27,72531,710

36,505 37,739

05

10152025303540

4K 8K 16K

Tran

sact

ion

s/se

c (x

10

3)

# Concurrent Flows

Linux

mTCP

Page 21: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Conclusion

• mTCP: A high-performing user-level TCP stack for multicore systems– Clean-slate user-level design to overcome inefficiency in kernel

• Make full use of extreme parallelism & batch processing– Per-core resource management

– Lock-free data structures & cache-aware threading

– Eliminate system call overhead

– Reduce context switch cost by event batching

• Achieve high performance scalability– Small message transactions: 3x to 25x better

– Existing applications: 33% (SSLShader) to 320% (lighttpd)

21

Page 22: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systemsnotav/mtcp_slides.pdf · 2014-04-13 · • mTCP: A high-performing user-level TCP stack for multicore systems –Clean-slate

Thank You

Source code is available at http://shader.kaist.edu/mtcp/

https://github.com/eunyoung14/mtcp