Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... ·...

Scalable End-‐to-‐End Data I/O over Enterprise and Data-‐Center Networks

Yufei Rennow with IBM T. J. Watson Research Centergraduated from Stony Brook University

Prof. Dantong Yu (Advisor), BNL & Stony Brook University

March 16th, 2016

2015 SPEC Distinguished Dissertation Award

Data I/O on Modern Hardware

• Bulk data movement• 100 Gbps high throughput

• Across WANs with long latency

• Storage data caching• Advanced hardware

• Zero-‐copy network• Large-‐scale asymmetric memory layout

• Existing solutions• TCP/IP based• GridFTP, OpenSSH• iSCSI, NFS

2

WAN

SAN

Network/Secure App Group

Computation Group

ArgonneData Center

BrookhavenData Center

Oak RidgeData Center

NERSC

3

ENTERPRISE DEEP LEARNING AUTOGAMINGDATACENTERPRO

VISUALIZATION

WATSON CHAINER THEANO MATCONVNET

TENSORFLOW CNTK TORCH CAFFE

Research Challenges

• Achieve zero-‐copy in end-‐to-‐end data path• Scale zero-‐copy based network transmission in WANs• Improve data locality in large-‐scale NUMA systems

3

Our Goal: Build systems for scalable end-‐to-‐end data I/O and Understand system performance characteristicsEfficient bulk data movement in LAN and WAN

Effective cache design for large scale memory systemsPerformance evaluation with advanced hardware

Thesis Outline

4

• Software design (Ch3)• RDMA-‐based data transfer protocol (Ch4)• Storage caching optimization (Ch5)• End-‐to-‐end performance optimization (Ch6)

WAN

SAN

Network/Secure App Group

Computation Group

ArgonneData Center

BrookhavenData Center

Oak RidgeData Center

NERSC

Ch3

Ch4Ch5

Ch6

End-‐to-‐End Asynchronous I/OPublications

• Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, “End-‐to-‐End Asynchronous Processing for High Throughput Computing”, under revision.

• [SC’13] Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, “Design and Performance Evaluation of NUMA-‐Aware RDMA-‐Based End-‐to-‐End Data Transfer Systems”, In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13), Denver, Colorado, November 2013.

• [JSS’13] Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, “Design and Testbed Evaluation of RDMA-‐based Middleware for High-‐performance Data Transfer Applications”, Journal of Systems and Software (JSS), Volume 86, Issue 7, July 2013, Pages 1850-‐1863, ISSN 0164-‐1212, 10.1016/j.jss.2013.01.070.

• [SC’12] Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian L. Tierney, Eric Pouyoul, “Protocols for Wide-‐Area Data-‐intensive Applications: Design and Performance Issues”, In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12), Salt Lake City, Utah, November 2012.

Classical Software Design

• OpenSSH & HPN-‐SCP• Heavily rely on data copy based OS services• Inter-‐process communication• TCP-‐based data transfer• Massive interrupt handling• Synchronous and blocking I/O

• Hardware Performance• Network: 80Gbps• Storage: 100Gbps• CPU AES Encryption: 142Gbps

6

pipescpsftp

ssh sshdsocket connection(single stream)

fork

scpsftp

pipe

fork

01020304050607080

Throughput (G

bps)

Application -‐ # of parallel processes

Secure Data Movement (AEC-‐CBC-‐128bit)

Network Transmission Profile

7

3 x 40 Gbps RoCE

• Memory Bandwidth: 400Gbps• NUMA Random pick: 83.5Gbps• NUMA-‐aware: 10% improvement• TCP wasn’t able to saturate this fat link• TCP benchmark takes shortcut

• User space memory -‐> CPU Cache -‐> Kernel space memory -‐> DMA to NIC

• TCP: 35% CPU is used for memory copy• copy_user_generic_string()

• RDMA achieved 98% bare-‐metal throughput.

91.883.5

0

20

40

60

80

100

120

TCP RDMA

Throughput (Gbps) 91.8

117.6

iperf iperf-rdma

Related work

• High Performance TCP-‐based solutions• GridFTP• bbcp

• RDMA Send/Receive for bulk data movement• Lai et al. ICPP’09

• RDMA write based solution• Tian et al. Journal of Comp & Elec Eng ’12• One RTT, not scalable to WAN

• Shared control channel and data channel• Not studied in WAN environment

8

• Efficient secure data transfer• Storage I/O, Computing, network I/O

• Orchestrate hardware and retain zero-‐copy• Observations on hardware trend

• Network – 2 magnitude improvement (1 Gbps -‐ 100 Gbpswith a single port)

• Storage – 1 magnitude improvement (200MB/s disk – 2GB/s flash)

• Multi-‐core– 18 cores per die X 8 socket• Memory capacity improvement throughput (200Gbps on a NUMA node)

• Memory-‐centric design

Overarching Goal

9

ACES Framework

• Asynchronous Concurrent Event-‐driven Staged• End-‐to-‐end zero-‐copy 10

SourceSink

Monitor thread(Accounting and thread adjustment)

Monitor thread(Accounting and thread adjustment)

Storage I/O

Data Integrity

Data Security

Network I/O

Storage I/O

Data Integrity

Data Security

Network I/O

Disks CPU/GPU CPU/GPU NICs

Task Queue

Asynchronous Storage I/O

• Asynchronous I/O separates I/O submission and I/O completion• libaio – Linux asynchronous I/O library• io_submit()• io_getevents()

11

New task:Prepare, submit

I/O Completion:Poll, post to next

Storage I/O Thread

Incoming Queue Outgoing Queue

Parallel Computing

• Multi-‐threaded parallelism• Encryption/Decryption• Advanced Encryption Standard (AES)• Session-‐based (i.e. single file)

• Data integrity• Block-‐based• crc32, Adler32

• Adjust # of threads dynamically

12

Computing Thread

Computing Thread

Computing Thread

Incoming Queue Outgoing Queue

Zero-‐copy Network I/O• High throughput

• Achieve line speed• Low cost

• CPU utilization• Memory footprints

• Scalability• 100 Gbps and Beyond• Long latency WAN

• Network independent• InfiniBand: High I/O depth on a single QP

• RoCE• iWARP: Multiple QPs

• Verbs• “Verbs Programming is like the assembly language version of network programming.”

13

Network Protocol Overview• Design choice

• One-‐sided for bulk data movement (credit-‐based)• Two-‐sided for control messages (send with inline)

• Asynchronous protocol to overlap latency impact• Multiple reliable queue pairs for data transfer• Proactive feedback

14

Process Load Data

Data

Source

Data

Sink

Control Msg QPget_free_blkput_ready_blk put_free_blkget_ready_blk

Bulk Data Transfer QPs

Process Offload Data

RFTP: RDMA-‐enabled FTP Service

Implementation: Software Architecture

• Data Structure• Memory management• Control message management

• Credit management• Connection management

• Threads• Multi-‐threaded• Master-‐worker

• Event-‐driven• Polling and dispatch

15

ThreadsData Structure

CQQP-1 QP-2 QP-n

Data Block List

Receive Control Message List

Send Control Message List

Remote MR Info List

application

system

Queue Pair List

Memory

Sender

CE dispatcher

...

CE worker-2

CE worker-1

Logger

Hardware

RNIC

1

234

Receive a Block of User Payload

LAN Evaluation

16

HP DL380 IBM X3650

MellanoxConnectX 3 VPI56Gbps FDR

Mellanox FDR SwitchInfiniBandSX6018

MellanoxQDR SwitchEthernet SX1036

HP DL380 IBM X3650

MellanoxConnectX 3 Ethernet

40Gbps QDR

RFTPOpenSSH SCPHPN-SCP

Secure Bulk Data Movement

17

01020304050607080

Throughput (G

bps)


Secure (AES-‐128bit) End-‐to-‐End Data Movement

0102030405060708090

100

CPU Utiliz

ation (%)


CPU Utilization

Data Source Data Sink

RFTP vs. GridFTP

18

0102030405060708090

100

Bandwidth (G

bps)

25 Minutes

Bandwidth Comparison

RFTP GridFTP

0

200

400

600

800

1000

1200

1400

RFTP source

GridFTP source

RFTP sink GridFTP sink

CPU Utiliz

ation (%)

CPU Comparison

user sys wait

Storage Threshold

3x3x

6x

40 Gbps Long Distance WAN Testbed

• 40 Gbps RoCEWAN• 4,000 miles loopback• RTT: 95 millisecond• BDP: 500 MB• Will RFTP be scalable in WAN?

19

OaklandCA

ChicagoIL

35

36

37

38

39

40

1M 2M 4M 8M 16MBa

ndwidth (G

bps)

Block Size

1 2 4 8 16 Streams

NUMA-‐aware Cache for iSCSI Servers

Publications

• [TPDS’15] Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, "Design, Implementation, and Evaluation of a NUMA-‐Aware Cache for iSCSI Storage Servers", IEEE Transactions on Parallel & Distributed Systems (TPDS), vol.26, no. 2, pp. 413-‐422, Feb. 2015

SAN Overview

• Large-‐scale asymmetric memory caching system• Massive I/O event processing• iSCSI/iSER: SCSI cmd and data in TCP/IP and RDMA• Cache: storage performance acceleration

21

iSCSIService

Initiator Target

CacheMassive I/O

event processing

Related work: Caching in SAN

22

Replacement Algorithm

Collaborative Cache Mgmt

Architecture-‐aware

Optimization

Aggressive-‐Collaborative

Hierarchy-‐aware

NUMA-‐aware Cache

Johnson VLDB 1994

MegiddoFAST2003

Wong ATC 2002

Zhou TPDS 2004

Caching in a NUMA host

23

Node 2Node 3

Node 1Node 0

T

src

dst

NUMA-‐unaware

T

src dst

NUMA-‐aware

3.3 GB/s 18.9 GB/sSTREAM

memcpy(dst, src, len);

• [ICPP’13] Tan Li, Yufei Ren, Dantong Yu, Shudong Jin, Thomas Robertazzi, “Characterization of Input/OutputBandwidth Performance Models in NUMA Architecture for Data Intensive Applications”, In Proceedings of the International Conference on Parallel Processing ICPP '13, Lyon, France, October 2013

iSCSI/iSER Performance Modeling for Cached Data

24

ayQueuingDelNetworkBW

OSizeIMemBWOSizeIRTTLatency +++=

//

i

n

iiCachedData MemBWWeightMemBW ∑

−

=

=1

0

*

• Latency (from initiator perspective)

• Bandwidth

),min( networkmemoryCachedData BWBWBW =

NetworkBW<<MemoryBW NUMARDMA

NetworkBW<MemoryBW

NUMA-‐aware Cache for SAN

• Allocate thread, cache, and network buffer in each node 25

Master

Worker

Network Buffers

Worker

NUMA-‐aware Cache

node-0

Worker

Network Buffers

Worker

NUMA-‐aware Cache

node-1

Evaluation• 4-‐node NUMA host – Dell R820• Intel Xeon E5-‐4620 with QPI• 768 GB RAM• Local memory bw: 18.71 GB/s• Remote memory bw: 3.26 GB/s

• 4 initiators• Network adapters: 56Gbps InfiniBand FDR• OS page cache vs. NUMA-‐aware Cache• Fully cached data access• Benchmark: fiowith random read I/Os

26

IBM x3650 IBM x3650

HP DL380 HP DL380

Mellanox FDR IB Switch

Initiators

Target

node0 node1

node3 node2

Dell R820

Latency (from Initiator perspective)

• Prompt response• Overhead vs. NUMA-‐aware Benefit

27

0100200300400500600700

4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB

Latency (us)

Request Size

NUMA-‐aware Cache Page Cache

overhead > benefit

overhead == benefit

overhead < benefit

23%

Throughput Comparison

28

0

2000

4000

6000

8000

10000

12000

16k 32k 64k 128k 256k 512k 1m

Throughput (M

B/s)

I/O Request Size

Throughput

NUMA-‐aware Cache OS page cache

80%

0

100

200

300

400

500

600

700

16k 32k 64k 128k 256k 512k 1mCPU Utiliz

ation (%)

I/O Request Size

CPU Utilization


Save 3 CPU cores

• Concurrent requests serving• Aggregate throughput

Throughput Comparison

• Master thread becomes the bottleneck29

0

2000

4000

6000

8000

10000

12000

16k 32k 64k 128k 256k 512k 1m

Throughput (M

B/s)

I/O Request Size

Throughput


0

100

200

300

400

500

600

700

16k 32k 64k 128k 256k 512k 1mCPU Utiliz

ation (%)

I/O Request Size

CPU Utilization


80%

Memory Bottleneck

Network Bottleneck Save 3 CPU cores

Decentralized Event Processing

• Standard iSCSI: Centralized event processing with a single thread• Create multiple threads and distribute event processing• Processing events is 3X as the number of requests

30

Connection Requests Master

Thread I/O Thread

Ack Helper Thread

Scheduling Events List

I/O Request List

I/O Thread

I/O Thread

I/O Thread

Finished List

Notification Pipe

Initiator

Initiator

Initiator

Event Processing Scalability

31

01002003004005006007008009001000

1 2 4 8 16 32 64

Throughput (K

-‐IOPS)

# of concurrent clients

IOPS with 512 Byte Small Requests

Decentralized Standard

1 Million IOPS

Real-‐life workloads

• YCSB – MongoDB – iSER Caching and Storage32

0

2000

4000

6000

8000

10000

12000

9:1 8:2 5:5

Transactions/sec

read-‐to-‐update ratio

Throughput (transactions/sec)

NUMA-‐aware Cache Page Cache

12.8%

3.5%

0%

YCSB

MongoDB

NUMA-‐awareCache Page Cache

Disks

tgtd

localhost

InfiniBand

Conclusions

ConclusionsChapter Contributions ResultsACES (ch3) Orchestrate resources on memory-‐centric

architectureUnder revision

RFTP (ch4) Scalable zero-‐copy based protocol for WAN [SC12, JSS13, HPGC12]

NUMA-‐aware Cache (ch5)

Improve SAN caching data operation locality

[TPDS15]

End-‐to-‐EndOptimization (ch6)

Extract line-‐speed I/O performance alongthe end-‐to-‐end data path

[SC13]

34

Software Contributions

• RFTP: RDMA-‐base FTP Software• http://ftp100.cewit.stonybrook.edu/rftp

• The Three Clause BSD License• User: High Energy Physics at Caltech

• RDMA benchmark• fio: RDMA engine and NUMA module

• Chelsio: http://www.chelsio.com/wp-‐content/uploads/resources/Lustre-‐Over-‐iWARP-‐vs-‐IB-‐FDR.pdf

• SanDisk: http://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf

• iperf-‐rdma• https://github.com/yufeiren/iperf-‐rdma

• China MinSheng Bank

• NUMA-‐aware Cache for iSCSI/iSER• Linux SCSI Target Framework tgtd

• https://bitbucket.org/dtyu/tgt

35

Thank you Q & A

36

Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... ·...

Documents

Transcript of Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... ·...