Memcached Design on High Performance RDMA Capable Interconnects
Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi-‐ur-‐Rahman, Nusrat S. Islam,
Xiangyong Ouyang, Hao Wang, Sayantan Sur & D. K. Panda
Network-‐Based Compu2ng Laboratory Department of Computer Science and Engineering
The Ohio State University, USA
ICPP - 2011
Outline
• IntroducLon • Overview of Memcached
• Modern High Performance Interconnects
• Unified CommunicaLon RunLme (UCR)
• Memcached Design using UCR
• Performance EvaluaLon
• Conclusion & Future Work
2
ICPP - 2011
IntroducLon • Tremendous increase in interest in interactive web-sites
(social networking, e-commerce etc.)
• Dynamic data is stored in databases for future retrieval and analysis
• Database lookups are expensive • Memcached – a distributed memory caching layer,
implemented using traditional BSD sockets
• Socket interface provides portability, but entails additional processing and multiple message copies
• High-Performance Computing (HPC) has adopted advanced interconnects (e.g. InfiniBand, 10 Gigabit Ethernet/iWARP, RoCE)
– Low latency, High Bandwidth, Low CPU overhead
• Many machines in Top500 list (http://www.top500.org) 3
ICPP - 2011
Outline
• IntroducLon • Overview of Memcached
• Modern High Performance Interconnects
• Unified CommunicaLon RunLme (UCR)
• Memcached Design using UCR
• Performance EvaluaLon
• Conclusion & Future Work
4
ICPP - 2011
Memcached Overview
• Memcached provides a scalable distributed caching • Spare memory in data-‐center servers can be aggregated to speedup
lookups • Basically a key-‐value distributed memory store • Keys can be any character strings, typically MD5 sums or hashes • Typically used to cache database queries, results of API calls or webpage
rendering elements • Scalable model, but typical usage very network intensive -‐Performance
directly related to that of underlying networking technology 5
Internet
Proxy Servers (Memcached Clients)
Memcached Servers
Database Servers
System Area
Network
System Area
Network
ICPP - 2011
Outline
• IntroducLon • Overview of Memcached
• Modern High Performance Interconnects
• Unified CommunicaLon RunLme (UCR)
• Memcached Design using UCR
• Performance EvaluaLon
• Conclusion & Future Work
6
ICPP - 2011
Modern High Performance Interconnects
7
ApplicaAon
IB Verbs Sockets ApplicaLon Interface
TCP/IP
Hardware Offload
TCP/IP
Ethernet Driver
Kernel Space
Protocol ImplementaLon
1/10 GigE Adapter
Ethernet Switch
Network Adapter
Network Switch
1/10 GigE
InfiniBand Adapter
InfiniBand Switch
IPoIB
IPoIB
SDP
RDMA User space
IB Verbs
InfiniBand Adapter
InfiniBand Switch
SDP
InfiniBand Adapter
InfiniBand Switch
RDMA
10 GigE Adapter
10 GigE Switch
10 GigE-‐TOE
ICPP - 2011
Problem Statement
• High-performance RDMA capable interconnects have emerged in the scientific computation domain
• Applications using Memcached are still relying on sockets
• Performance of Memcached is critical to most of its deployments
• Can Memcached be re-designed from the ground up to utilize RDMA capable networks?
8
ICPP - 2011
A New Approach using Unified CommunicaLon RunLme (UCR)
9
Current Approach
ApplicaAon
Sockets
1/10 GigE Network
• Sockets not designed for high-‐performance – Stream semanLcs o\en mismatch for upper layers (Memcached, Hadoop) – MulLple copies can be involved
Our Approach
ApplicaAon
IB Verbs
RDMA Capable N/ws (IB, 10GE, iWARP, RoCE ...)
UCR
ICPP - 2011
Outline
• IntroducLon & MoLvaLon
• Overview of Memcached
• Modern High Performance Interconnects
• Unified CommunicaLon RunLme (UCR)
• Memcached Design using UCR
• Performance EvaluaLon
• Conclusion & Future Work
10
ICPP - 2011
Unified CommunicaLon RunLme (UCR)
• Initially proposed to unify communication runtimes of different parallel programming models – J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI
Runtimes: Experience with MVAPICH, (PGAS’10)
• Design of UCR evolved from MVAPICH/MVAPICH2 software stacks (h`p://mvapich.cse.ohio-‐state.edu/)
• UCR provides interfaces for Active Messages as well as one-sided put/get operations
• Enhanced APIs to support Cloud computing applications • Several enhancements in UCR
– end-point based design, revamped active-message API, fault tolerance and synchronization with timeouts.
• Communications based on endpoint, analogous to sockets 11
ICPP - 2011
AcLve Messaging in UCR
12
• Active messages are proven to be very powerful in many environments • GASNet Project (UC Berkeley), MPI design using LAPI (IBM), etc.
• We introduce Active messages into the data-center domain • An Active Message consists of two parts – header and data • When the message arrives at the target, header handler is run • Header handler identifies the destination buffer for the data • Data is put into the destination buffer • Completion handler is run afterwards (optional) • Special flags to indicate local & remote completions (optional)
ICPP - 2011
AcLve Messaging in UCR (contd.)
13
(General Active Message Functionality) (Optimized Short Active Message Functionality)
Origin Target Header
HeaderHandler
CompletionHandler
Set RComplFlag
Set ComplFlag
RDMA Data
Set LComplFlag
Origin Target Header + Data
HeaderHandler
CompletionHandler
Set ComplFlag
Set LComplFlag
Copy Data
Set RComplFlag
ICPP - 2011
Outline
• IntroducLon & MoLvaLon
• Overview of Memcached
• Modern High Performance Interconnects
• Unified CommunicaLon RunLme (UCR)
• Memcached Design using UCR
• Performance EvaluaLon
• Conclusion & Future Work
14
ICPP - 2011
Memcached Design using UCR
15
• Server and client perform a negoLaLon protocol – Master thread assigns clients to appropriate worker thread
• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread
• All other Memcached data structures are shared among RDMA and Sockets worker threads
Sockets Client
RDMA Client
Master Thread
Sockets Worker Thread
Verbs Worker Thread
Sockets Worker Thread
Verbs Worker Thread
Shared Data
Memory Slabs Items …
ICPP - 2011
Outline
• IntroducLon & MoLvaLon
• Overview of Memcached
• Modern High Performance Interconnects
• Unified CommunicaLon RunLme (UCR)
• Memcached Design using UCR
• Performance EvaluaLon
• Conclusion & Future Work
16
ICPP - 2011
Experimental Setup • Used Two Clusters
– Intel Clovertown • Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-‐core CPUs,
6 GB main memory, 250 GB hard disk • Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)
– Intel Westmere • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-‐core CPUs,
12 GB main memory, 160 GB hard disk • Network: 1GigE, IPoIB, and IB (QDR)
• Memcached So\ware – Memcached Server: 1.4.5 – Memcached Client: (libmemcached) 0.45
17
ICPP - 2011
Memcached Get Latency (Small Message)
18
• Memcached Get latency – 4 bytes – DDR: 6 us; QDR: 5 us – 4K bytes -‐-‐ DDR: 20 us; QDR:12 us
• Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR cluster • Almost factor of seven improvement over IPoIB for 4KB on the QDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Tim
e (u
s)
Message Size
0
50
100
150
200
250
300
350
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Tim
e (u
s)
Message Size
SDP
IPoIB
OSU Design
1G
10G-TOE
ICPP - 2011
Memcached Get Latency (Large Message)
19
• Memcached Get latency – 8K bytes – DDR: 17 us; QDR: 13 us – 512K bytes -‐-‐ DDR: 362 us; QDR: 94 us
• Almost factor of three improvement over 10GE(TOE) for 512KB on the DDR cluster • Almost factor of four improvement over IPoIB for 512K bytes on the QDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
4K 8K 16K 32K 64K 128K 256K 512K
Tim
e (u
s)
Message Size
0
100
200
300
400
500
600
700
4K 8K 16K 32K 64K 128K 512K Message Size
SDP
IPoIB
OSU Design
1G
10G-TOE
ICPP - 2011
Memcached Set Latency (Small Message)
20
• Memcached Set latency – 4 bytes – DDR: 7 us; QDR: 5 us – 4K bytes -‐-‐ DDR: 15 us; QDR:13 us
• Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR Cluster • Almost factor of six improvement over IPoIB for 4KB on the QDR Cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Tim
e (u
s)
Message Size
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size
SDP
IPoIB
OSU Design
1G
10G-TOE
ICPP - 2011
Memcached Set Latency (Large Message)
21
• Memcached Get latency – 8K bytes – DDR: 18 us; QDR: 15 us – 512K bytes -‐-‐ DDR: 375 us; QDR:185 us
• Almost factor of two improvement over 10GE (TOE) for 512KB on the DDR cluster • Almost factor of three improvement over IPoIB for 512KB on the QDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
4K 8K 16K 32K 64K 128K 256K 512K
Tim
e (u
s)
Message Size
0
100
200
300
400
500
600
700
800
4K 8K 16K 32K 64K 128K 256K 512K Message Size
SDP
IPoIB
OSU Design
1G
10G-TOE
ICPP - 2011
Memcached Latency (10% Set, 90% Get)
22
• Memcached Get latency – 4 bytes – DDR: 7 us; QDR: 5 us – 4K bytes -‐-‐ DDR: 15 us; QDR:12 us
• Almost factor of four improvement over 10GE (TOE) for 4K bytes on the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Tim
e (u
s)
Message Size
0
50
100
150
200
250
300
350
1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size
SDP
IPoIB
OSU Design
1G
10G
ICPP - 2011
Memcached Latency (50% Set, 50% Get)
23
• Memcached Get latency – 4 bytes – DDR: 7 us; QDR: 5 us – 4K bytes -‐-‐ DDR: 15 us; QDR:12 us
• Almost factor of four improvement over 10GE (TOE) for 4K bytes on the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0 10 20 30 40 50 60 70 80 90
100
1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size
SDP
IPoIB
OSU Design
1G
10G-TOE
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Tim
e (u
s)
Message Size
ICPP - 2011
Memcached Get TPS (4byte)
24
• Memcached Get transacLons per second for 4 bytes – On IB DDR about 600K/s for 16 clients – On IB QDR 1.9M/s for 16 clients
• Almost factor of six improvement over 10GE (TOE) • Significant improvement with naLve IB QDR compared to SDP and
IPoIB
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
8 Clients 16 Clients
SDP
IPoIB
OSU Design
0
100
200
300
400
500
600
700
8 Clients 16 Clients
Thou
sand
s of
Tra
nsac
tions
per
se
cond
(TPS
)
SDP
10G - TOE
OSU Design
ICPP - 2011
Memcached Get TPS (4KB)
25
• Memcached Get transacLons per second for 4K bytes – On IB DDR about 330K/s for 16 clients – On IB QDR 842K/s for 16 clients
• Almost factor of four improvement over 10GE (TOE) • Significant improvement with naLve IB QDR
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
8 Clients 16 Clients
SDP
IPoIB
OSU Design
0
50000
100000
150000
200000
250000
300000
350000
8 Clients 16 Clients
Tra
nsac
tions
per
sec
ond(
TPS)
SDP
10G - TOE
OSU Design
ICPP - 2011
Outline
• IntroducLon & MoLvaLon
• Overview of Memcached
• Modern High Performance Interconnects
• Unified CommunicaLon RunLme (UCR)
• Memcached Design using UCR
• Performance EvaluaLon
• Conclusion & Future Work
26
ICPP - 2011
Conclusion & Future Work
27
• Described a novel design of Memcached for RDMA capable networks
• Provided a detailed performance comparison of our design compared to unmodified Memcached using sockets over RDMA and 10GE networks
• Observed significant performance improvement with the proposed design
• Factor of four improvement in Memcached get latency (4K bytes)
• Factor of six improvement in Memcached get transacLons/s (4 bytes)
• We plan to improve UCR by taking into account the many features in OpenFabrics API , Unreliable Datagram transport and designing iWARP and RoCE versions of UCR, and thereby scaling Memcached
• We are working on enhancing the Hadoop/HBase designs for RDMA capable networks
ICPP - 2011
Thank You! {jose, subramon, luom, zhanjmin, huangjia, rahmanmd, islamn, ouyangx, wangh, surs,panda}@cse.ohio-‐state.edu
Network-‐Based CompuLng Laboratory h`p://nowlab.cse.ohio-‐state.edu/
28
MVAPICH Web Page h`p://mvapich.cse.ohio-‐state.edu/
Top Related