Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... ·...
Transcript of Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... ·...
Scalable End-‐to-‐End Data I/O over Enterprise and Data-‐Center Networks
Yufei Rennow with IBM T. J. Watson Research Centergraduated from Stony Brook University
Prof. Dantong Yu (Advisor), BNL & Stony Brook University
March 16th, 2016
2015 SPEC Distinguished Dissertation Award
Data I/O on Modern Hardware
• Bulk data movement• 100 Gbps high throughput
• Across WANs with long latency
• Storage data caching• Advanced hardware
• Zero-‐copy network• Large-‐scale asymmetric memory layout
• Existing solutions• TCP/IP based• GridFTP, OpenSSH• iSCSI, NFS
2
WAN
SAN
Network/Secure App Group
Computation Group
ArgonneData Center
BrookhavenData Center
Oak RidgeData Center
NERSC
3
ENTERPRISE DEEP LEARNING AUTOGAMINGDATACENTERPRO
VISUALIZATION
WATSON CHAINER THEANO MATCONVNET
TENSORFLOW CNTK TORCH CAFFE
Research Challenges
• Achieve zero-‐copy in end-‐to-‐end data path• Scale zero-‐copy based network transmission in WANs• Improve data locality in large-‐scale NUMA systems
3
Our Goal: Build systems for scalable end-‐to-‐end data I/O and Understand system performance characteristicsEfficient bulk data movement in LAN and WAN
Effective cache design for large scale memory systemsPerformance evaluation with advanced hardware
Thesis Outline
4
• Software design (Ch3)• RDMA-‐based data transfer protocol (Ch4)• Storage caching optimization (Ch5)• End-‐to-‐end performance optimization (Ch6)
WAN
SAN
Network/Secure App Group
Computation Group
ArgonneData Center
BrookhavenData Center
Oak RidgeData Center
NERSC
Ch3
Ch4Ch5
Ch6
End-‐to-‐End Asynchronous I/OPublications
• Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, “End-‐to-‐End Asynchronous Processing for High Throughput Computing”, under revision.
• [SC’13] Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, “Design and Performance Evaluation of NUMA-‐Aware RDMA-‐Based End-‐to-‐End Data Transfer Systems”, In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13), Denver, Colorado, November 2013.
• [JSS’13] Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, “Design and Testbed Evaluation of RDMA-‐based Middleware for High-‐performance Data Transfer Applications”, Journal of Systems and Software (JSS), Volume 86, Issue 7, July 2013, Pages 1850-‐1863, ISSN 0164-‐1212, 10.1016/j.jss.2013.01.070.
• [SC’12] Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian L. Tierney, Eric Pouyoul, “Protocols for Wide-‐Area Data-‐intensive Applications: Design and Performance Issues”, In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12), Salt Lake City, Utah, November 2012.
Classical Software Design
• OpenSSH & HPN-‐SCP• Heavily rely on data copy based OS services• Inter-‐process communication• TCP-‐based data transfer• Massive interrupt handling• Synchronous and blocking I/O
• Hardware Performance• Network: 80Gbps• Storage: 100Gbps• CPU AES Encryption: 142Gbps
6
pipescpsftp
ssh sshdsocket connection(single stream)
fork
scpsftp
pipe
fork
01020304050607080
Throughput (G
bps)
Application -‐ # of parallel processes
Secure Data Movement (AEC-‐CBC-‐128bit)
Network Transmission Profile
7
3 x 40 Gbps RoCE
• Memory Bandwidth: 400Gbps• NUMA Random pick: 83.5Gbps• NUMA-‐aware: 10% improvement• TCP wasn’t able to saturate this fat link• TCP benchmark takes shortcut
• User space memory -‐> CPU Cache -‐> Kernel space memory -‐> DMA to NIC
• TCP: 35% CPU is used for memory copy• copy_user_generic_string()
• RDMA achieved 98% bare-‐metal throughput.
91.883.5
0
20
40
60
80
100
120
TCP RDMA
Throughput (Gbps) 91.8
117.6
iperf iperf-rdma
Related work
• High Performance TCP-‐based solutions• GridFTP• bbcp
• RDMA Send/Receive for bulk data movement• Lai et al. ICPP’09
• RDMA write based solution• Tian et al. Journal of Comp & Elec Eng ’12• One RTT, not scalable to WAN
• Shared control channel and data channel• Not studied in WAN environment
8
• Efficient secure data transfer• Storage I/O, Computing, network I/O
• Orchestrate hardware and retain zero-‐copy• Observations on hardware trend
• Network – 2 magnitude improvement (1 Gbps -‐ 100 Gbpswith a single port)
• Storage – 1 magnitude improvement (200MB/s disk – 2GB/s flash)
• Multi-‐core– 18 cores per die X 8 socket• Memory capacity improvement throughput (200Gbps on a NUMA node)
• Memory-‐centric design
Overarching Goal
9
ACES Framework
• Asynchronous Concurrent Event-‐driven Staged• End-‐to-‐end zero-‐copy 10
SourceSink
Monitor thread(Accounting and thread adjustment)
Monitor thread(Accounting and thread adjustment)
Storage I/O
Data Integrity
Data Security
Network I/O
Storage I/O
Data Integrity
Data Security
Network I/O
Disks CPU/GPU CPU/GPU NICs
Task Queue
Asynchronous Storage I/O
• Asynchronous I/O separates I/O submission and I/O completion• libaio – Linux asynchronous I/O library• io_submit()• io_getevents()
11
New task:Prepare, submit
I/O Completion:Poll, post to next
Storage I/O Thread
Incoming Queue Outgoing Queue
Parallel Computing
• Multi-‐threaded parallelism• Encryption/Decryption• Advanced Encryption Standard (AES)• Session-‐based (i.e. single file)
• Data integrity• Block-‐based• crc32, Adler32
• Adjust # of threads dynamically
12
Computing Thread
Computing Thread
Computing Thread
Incoming Queue Outgoing Queue
Zero-‐copy Network I/O• High throughput
• Achieve line speed• Low cost
• CPU utilization• Memory footprints
• Scalability• 100 Gbps and Beyond• Long latency WAN
• Network independent• InfiniBand: High I/O depth on a single QP
• RoCE• iWARP: Multiple QPs
• Verbs• “Verbs Programming is like the assembly language version of network programming.”
13
Network Protocol Overview• Design choice
• One-‐sided for bulk data movement (credit-‐based)• Two-‐sided for control messages (send with inline)
• Asynchronous protocol to overlap latency impact• Multiple reliable queue pairs for data transfer• Proactive feedback
14
Process Load Data
Data
Source
Data
Sink
Control Msg QPget_free_blkput_ready_blk put_free_blkget_ready_blk
Bulk Data Transfer QPs
Process Offload Data
RFTP: RDMA-‐enabled FTP Service
Implementation: Software Architecture
• Data Structure• Memory management• Control message management
• Credit management• Connection management
• Threads• Multi-‐threaded• Master-‐worker
• Event-‐driven• Polling and dispatch
15
ThreadsData Structure
CQQP-1 QP-2 QP-n
Data Block List
Receive Control Message List
Send Control Message List
Remote MR Info List
application
system
Queue Pair List
Memory
Sender
CE dispatcher
...
CE worker-2
CE worker-1
Logger
Hardware
RNIC
1
234
Receive a Block of User Payload
LAN Evaluation
16
HP DL380 IBM X3650
MellanoxConnectX 3 VPI56Gbps FDR
Mellanox FDR SwitchInfiniBandSX6018
MellanoxQDR SwitchEthernet SX1036
HP DL380 IBM X3650
MellanoxConnectX 3 Ethernet
40Gbps QDR
RFTPOpenSSH SCPHPN-SCP
Secure Bulk Data Movement
17
01020304050607080
Throughput (G
bps)
Application -‐ # of parallel processes
Secure (AES-‐128bit) End-‐to-‐End Data Movement
0102030405060708090
100
CPU Utiliz
ation (%)
Application -‐ # of parallel processes
CPU Utilization
Data Source Data Sink
RFTP vs. GridFTP
18
0102030405060708090
100
Bandwidth (G
bps)
25 Minutes
Bandwidth Comparison
RFTP GridFTP
0
200
400
600
800
1000
1200
1400
RFTP source
GridFTP source
RFTP sink GridFTP sink
CPU Utiliz
ation (%)
CPU Comparison
user sys wait
Storage Threshold
3x3x
6x
40 Gbps Long Distance WAN Testbed
• 40 Gbps RoCEWAN• 4,000 miles loopback• RTT: 95 millisecond• BDP: 500 MB• Will RFTP be scalable in WAN?
19
OaklandCA
ChicagoIL
35
36
37
38
39
40
1M 2M 4M 8M 16MBa
ndwidth (G
bps)
Block Size
1 2 4 8 16 Streams
NUMA-‐aware Cache for iSCSI Servers
Publications
• [TPDS’15] Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, "Design, Implementation, and Evaluation of a NUMA-‐Aware Cache for iSCSI Storage Servers", IEEE Transactions on Parallel & Distributed Systems (TPDS), vol.26, no. 2, pp. 413-‐422, Feb. 2015
SAN Overview
• Large-‐scale asymmetric memory caching system• Massive I/O event processing• iSCSI/iSER: SCSI cmd and data in TCP/IP and RDMA• Cache: storage performance acceleration
21
iSCSIService
Initiator Target
CacheMassive I/O
event processing
Related work: Caching in SAN
22
Replacement Algorithm
Collaborative Cache Mgmt
Architecture-‐aware
Optimization
Aggressive-‐Collaborative
Hierarchy-‐aware
NUMA-‐aware Cache
Johnson VLDB 1994
MegiddoFAST2003
Wong ATC 2002
Zhou TPDS 2004
Caching in a NUMA host
23
Node 2Node 3
Node 1Node 0
T
src
dst
NUMA-‐unaware
T
src dst
NUMA-‐aware
3.3 GB/s 18.9 GB/sSTREAM
memcpy(dst, src, len);
• [ICPP’13] Tan Li, Yufei Ren, Dantong Yu, Shudong Jin, Thomas Robertazzi, “Characterization of Input/OutputBandwidth Performance Models in NUMA Architecture for Data Intensive Applications”, In Proceedings of the International Conference on Parallel Processing ICPP '13, Lyon, France, October 2013
iSCSI/iSER Performance Modeling for Cached Data
24
ayQueuingDelNetworkBW
OSizeIMemBWOSizeIRTTLatency +++=
//
i
n
iiCachedData MemBWWeightMemBW ∑
−
=
=1
0
*
• Latency (from initiator perspective)
• Bandwidth
),min( networkmemoryCachedData BWBWBW =
NetworkBW<<MemoryBW NUMARDMA
NetworkBW<MemoryBW
NUMA-‐aware Cache for SAN
• Allocate thread, cache, and network buffer in each node 25
Master
Worker
Network Buffers
Worker
NUMA-‐aware Cache
node-0
Worker
Network Buffers
Worker
NUMA-‐aware Cache
node-1
Evaluation• 4-‐node NUMA host – Dell R820• Intel Xeon E5-‐4620 with QPI• 768 GB RAM• Local memory bw: 18.71 GB/s• Remote memory bw: 3.26 GB/s
• 4 initiators• Network adapters: 56Gbps InfiniBand FDR• OS page cache vs. NUMA-‐aware Cache• Fully cached data access• Benchmark: fiowith random read I/Os
26
IBM x3650 IBM x3650
HP DL380 HP DL380
Mellanox FDR IB Switch
Initiators
Target
node0 node1
node3 node2
Dell R820
Latency (from Initiator perspective)
• Prompt response• Overhead vs. NUMA-‐aware Benefit
27
0100200300400500600700
4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB
Latency (us)
Request Size
NUMA-‐aware Cache Page Cache
overhead > benefit
overhead == benefit
overhead < benefit
23%
Throughput Comparison
28
0
2000
4000
6000
8000
10000
12000
16k 32k 64k 128k 256k 512k 1m
Throughput (M
B/s)
I/O Request Size
Throughput
NUMA-‐aware Cache OS page cache
80%
0
100
200
300
400
500
600
700
16k 32k 64k 128k 256k 512k 1mCPU Utiliz
ation (%)
I/O Request Size
CPU Utilization
NUMA-‐aware Cache OS page cache
Save 3 CPU cores
• Concurrent requests serving• Aggregate throughput
Throughput Comparison
• Master thread becomes the bottleneck29
0
2000
4000
6000
8000
10000
12000
16k 32k 64k 128k 256k 512k 1m
Throughput (M
B/s)
I/O Request Size
Throughput
NUMA-‐aware Cache OS page cache
0
100
200
300
400
500
600
700
16k 32k 64k 128k 256k 512k 1mCPU Utiliz
ation (%)
I/O Request Size
CPU Utilization
NUMA-‐aware Cache OS page cache
80%
Memory Bottleneck
Network Bottleneck Save 3 CPU cores
Decentralized Event Processing
• Standard iSCSI: Centralized event processing with a single thread• Create multiple threads and distribute event processing• Processing events is 3X as the number of requests
30
Connection Requests Master
Thread I/O Thread
Ack Helper Thread
Scheduling Events List
I/O Request List
I/O Thread
I/O Thread
I/O Thread
Finished List
Notification Pipe
Initiator
Initiator
Initiator
Event Processing Scalability
31
01002003004005006007008009001000
1 2 4 8 16 32 64
Throughput (K
-‐IOPS)
# of concurrent clients
IOPS with 512 Byte Small Requests
Decentralized Standard
1 Million IOPS
Real-‐life workloads
• YCSB – MongoDB – iSER Caching and Storage32
0
2000
4000
6000
8000
10000
12000
9:1 8:2 5:5
Transactions/sec
read-‐to-‐update ratio
Throughput (transactions/sec)
NUMA-‐aware Cache Page Cache
12.8%
3.5%
0%
YCSB
MongoDB
NUMA-‐awareCache Page Cache
Disks
tgtd
localhost
InfiniBand
Conclusions
ConclusionsChapter Contributions ResultsACES (ch3) Orchestrate resources on memory-‐centric
architectureUnder revision
RFTP (ch4) Scalable zero-‐copy based protocol for WAN [SC12, JSS13, HPGC12]
NUMA-‐aware Cache (ch5)
Improve SAN caching data operation locality
[TPDS15]
End-‐to-‐EndOptimization (ch6)
Extract line-‐speed I/O performance alongthe end-‐to-‐end data path
[SC13]
34
Software Contributions
• RFTP: RDMA-‐base FTP Software• http://ftp100.cewit.stonybrook.edu/rftp
• The Three Clause BSD License• User: High Energy Physics at Caltech
• RDMA benchmark• fio: RDMA engine and NUMA module
• Chelsio: http://www.chelsio.com/wp-‐content/uploads/resources/Lustre-‐Over-‐iWARP-‐vs-‐IB-‐FDR.pdf
• SanDisk: http://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf
• iperf-‐rdma• https://github.com/yufeiren/iperf-‐rdma
• China MinSheng Bank
• NUMA-‐aware Cache for iSCSI/iSER• Linux SCSI Target Framework tgtd
• https://bitbucket.org/dtyu/tgt
35
Thank you Q & A
36