Memcached(Design(on(High(Performance(...

Memcached Design on High Performance RDMA Capable Interconnects

Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi-‐ur-‐Rahman, Nusrat S. Islam,

Xiangyong Ouyang, Hao Wang, Sayantan Sur & D. K. Panda

Network-‐Based Compu2ng Laboratory Department of Computer Science and Engineering

The Ohio State University, USA

ICPP - 2011

Outline

•  IntroducLon •  Overview of Memcached

•  Modern High Performance Interconnects

•  Unified CommunicaLon RunLme (UCR)

•  Memcached Design using UCR

•  Performance EvaluaLon

•  Conclusion & Future Work

2

ICPP - 2011

IntroducLon •  Tremendous increase in interest in interactive web-sites

(social networking, e-commerce etc.)

•  Dynamic data is stored in databases for future retrieval and analysis

•  Database lookups are expensive •  Memcached – a distributed memory caching layer,

implemented using traditional BSD sockets

•  Socket interface provides portability, but entails additional processing and multiple message copies

•  High-Performance Computing (HPC) has adopted advanced interconnects (e.g. InfiniBand, 10 Gigabit Ethernet/iWARP, RoCE)

–  Low latency, High Bandwidth, Low CPU overhead

•  Many machines in Top500 list (http://www.top500.org) 3

ICPP - 2011

Outline







4

ICPP - 2011

Memcached Overview

•  Memcached provides a scalable distributed caching •  Spare memory in data-‐center servers can be aggregated to speedup

lookups •  Basically a key-‐value distributed memory store •  Keys can be any character strings, typically MD5 sums or hashes •  Typically used to cache database queries, results of API calls or webpage

rendering elements •  Scalable model, but typical usage very network intensive -‐Performance

directly related to that of underlying networking technology 5

Internet

Proxy Servers (Memcached Clients)

Memcached Servers

Database Servers

System Area

Network

System Area

Network

ICPP - 2011

Outline







6

ICPP - 2011

Modern High Performance Interconnects

7

ApplicaAon

IB Verbs Sockets ApplicaLon Interface

TCP/IP

Hardware Offload

TCP/IP

Ethernet Driver

Kernel Space

Protocol ImplementaLon

1/10 GigE Adapter

Ethernet Switch

Network Adapter

Network Switch

1/10 GigE

InfiniBand Adapter

InfiniBand Switch

IPoIB

IPoIB

SDP

RDMA User space

IB Verbs

InfiniBand Adapter

InfiniBand Switch

SDP

InfiniBand Adapter

InfiniBand Switch

RDMA

10 GigE Adapter

10 GigE Switch

10 GigE-‐TOE

ICPP - 2011

Problem Statement

•  High-performance RDMA capable interconnects have emerged in the scientific computation domain

•  Applications using Memcached are still relying on sockets

•  Performance of Memcached is critical to most of its deployments

•  Can Memcached be re-designed from the ground up to utilize RDMA capable networks?

8

ICPP - 2011

A New Approach using Unified CommunicaLon RunLme (UCR)

9

Current Approach

ApplicaAon

Sockets

1/10 GigE Network

•  Sockets not designed for high-‐performance –  Stream semanLcs o\en mismatch for upper layers (Memcached, Hadoop) –  MulLple copies can be involved

Our Approach

ApplicaAon

IB Verbs

RDMA Capable N/ws (IB, 10GE, iWARP, RoCE ...)

UCR

ICPP - 2011

Outline

•  IntroducLon & MoLvaLon

•  Overview of Memcached






10

ICPP - 2011

Unified CommunicaLon RunLme (UCR)

•  Initially proposed to unify communication runtimes of different parallel programming models –  J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI

Runtimes: Experience with MVAPICH, (PGAS’10)

•  Design of UCR evolved from MVAPICH/MVAPICH2 software stacks (h`p://mvapich.cse.ohio-‐state.edu/)

•  UCR provides interfaces for Active Messages as well as one-sided put/get operations

•  Enhanced APIs to support Cloud computing applications •  Several enhancements in UCR

–  end-point based design, revamped active-message API, fault tolerance and synchronization with timeouts.

•  Communications based on endpoint, analogous to sockets 11

ICPP - 2011

AcLve Messaging in UCR

12

•  Active messages are proven to be very powerful in many environments •  GASNet Project (UC Berkeley), MPI design using LAPI (IBM), etc.

•  We introduce Active messages into the data-center domain •  An Active Message consists of two parts – header and data •  When the message arrives at the target, header handler is run •  Header handler identifies the destination buffer for the data •  Data is put into the destination buffer •  Completion handler is run afterwards (optional) •  Special flags to indicate local & remote completions (optional)

ICPP - 2011

AcLve Messaging in UCR (contd.)

13

(General Active Message Functionality) (Optimized Short Active Message Functionality)

Origin Target Header

HeaderHandler

CompletionHandler

Set RComplFlag

Set ComplFlag

RDMA Data

Set LComplFlag

Origin Target Header + Data

HeaderHandler

CompletionHandler

Set ComplFlag

Set LComplFlag

Copy Data

Set RComplFlag

ICPP - 2011

Outline








14

ICPP - 2011

Memcached Design using UCR

15

•  Server and client perform a negoLaLon protocol –  Master thread assigns clients to appropriate worker thread

•  Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

•  All other Memcached data structures are shared among RDMA and Sockets worker threads

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared Data

Memory Slabs Items …

ICPP - 2011

Outline








16

ICPP - 2011

Experimental Setup •  Used Two Clusters

–  Intel Clovertown •  Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-‐core CPUs,

6 GB main memory, 250 GB hard disk •  Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)

–  Intel Westmere •  Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-‐core CPUs,

12 GB main memory, 160 GB hard disk •  Network: 1GigE, IPoIB, and IB (QDR)

•  Memcached So\ware –  Memcached Server: 1.4.5 –  Memcached Client: (libmemcached) 0.45

17

ICPP - 2011

Memcached Get Latency (Small Message)

18

•  Memcached Get latency –  4 bytes – DDR: 6 us; QDR: 5 us –  4K bytes -‐-‐ DDR: 20 us; QDR:12 us

•  Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR cluster •  Almost factor of seven improvement over IPoIB for 4KB on the QDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Tim

e (u

s)

Message Size

0

50

100

150

200

250

300

350

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Tim

e (u

s)

Message Size

SDP

IPoIB

OSU Design

1G

10G-TOE

ICPP - 2011

Memcached Get Latency (Large Message)

19

•  Memcached Get latency –  8K bytes – DDR: 17 us; QDR: 13 us –  512K bytes -‐-‐ DDR: 362 us; QDR: 94 us

•  Almost factor of three improvement over 10GE(TOE) for 512KB on the DDR cluster •  Almost factor of four improvement over IPoIB for 512K bytes on the QDR cluster


0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

4K 8K 16K 32K 64K 128K 256K 512K

Tim

e (u

s)

Message Size

0

100

200

300

400

500

600

700

4K 8K 16K 32K 64K 128K 512K Message Size

SDP

IPoIB

OSU Design

1G

10G-TOE

ICPP - 2011

Memcached Set Latency (Small Message)

20

•  Memcached Set latency –  4 bytes – DDR: 7 us; QDR: 5 us –  4K bytes -‐-‐ DDR: 15 us; QDR:13 us

•  Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR Cluster •  Almost factor of six improvement over IPoIB for 4KB on the QDR Cluster


0

50

100

150

200

250

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Tim

e (u

s)

Message Size

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size

SDP

IPoIB

OSU Design

1G

10G-TOE

ICPP - 2011

Memcached Set Latency (Large Message)

21

•  Memcached Get latency –  8K bytes – DDR: 18 us; QDR: 15 us –  512K bytes -‐-‐ DDR: 375 us; QDR:185 us

•  Almost factor of two improvement over 10GE (TOE) for 512KB on the DDR cluster •  Almost factor of three improvement over IPoIB for 512KB on the QDR cluster


0

500

1000

1500

2000

2500

3000

3500

4000

4500

4K 8K 16K 32K 64K 128K 256K 512K

Tim

e (u

s)

Message Size

0

100

200

300

400

500

600

700

800

4K 8K 16K 32K 64K 128K 256K 512K Message Size

SDP

IPoIB

OSU Design

1G

10G-TOE

ICPP - 2011

Memcached Latency (10% Set, 90% Get)

22


•  Almost factor of four improvement over 10GE (TOE) for 4K bytes on the DDR cluster


0

50

100

150

200

250

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Tim

e (u

s)

Message Size

0

50

100

150

200

250

300

350

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size

SDP

IPoIB

OSU Design

1G

10G

ICPP - 2011

Memcached Latency (50% Set, 50% Get)

23


•  Almost factor of four improvement over 10GE (TOE) for 4K bytes on the DDR cluster


0 10 20 30 40 50 60 70 80 90

100

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size

SDP

IPoIB

OSU Design

1G

10G-TOE

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Tim

e (u

s)

Message Size

ICPP - 2011

Memcached Get TPS (4byte)

24

•  Memcached Get transacLons per second for 4 bytes –  On IB DDR about 600K/s for 16 clients –  On IB QDR 1.9M/s for 16 clients

•  Almost factor of six improvement over 10GE (TOE) •  Significant improvement with naLve IB QDR compared to SDP and

IPoIB


0

200

400

600

800

1000

1200

1400

1600

1800

2000

8 Clients 16 Clients

SDP

IPoIB

OSU Design

0

100

200

300

400

500

600

700


Thou

sand

s of

Tra

nsac

tions

per

se

cond

(TPS

)

SDP

10G - TOE

OSU Design

ICPP - 2011

Memcached Get TPS (4KB)

25

•  Memcached Get transacLons per second for 4K bytes –  On IB DDR about 330K/s for 16 clients –  On IB QDR 842K/s for 16 clients

•  Almost factor of four improvement over 10GE (TOE) •  Significant improvement with naLve IB QDR


0

100000

200000

300000

400000

500000

600000

700000

800000

900000


SDP

IPoIB

OSU Design

0

50000

100000

150000

200000

250000

300000

350000


Tra

nsac

tions

per

sec

ond(

TPS)

SDP

10G - TOE

OSU Design

ICPP - 2011

Outline








26

ICPP - 2011

Conclusion & Future Work

27

•  Described a novel design of Memcached for RDMA capable networks

•  Provided a detailed performance comparison of our design compared to unmodified Memcached using sockets over RDMA and 10GE networks

•  Observed significant performance improvement with the proposed design

•  Factor of four improvement in Memcached get latency (4K bytes)

•  Factor of six improvement in Memcached get transacLons/s (4 bytes)

•  We plan to improve UCR by taking into account the many features in OpenFabrics API , Unreliable Datagram transport and designing iWARP and RoCE versions of UCR, and thereby scaling Memcached

•  We are working on enhancing the Hadoop/HBase designs for RDMA capable networks

ICPP - 2011

Thank You! {jose, subramon, luom, zhanjmin, huangjia, rahmanmd, islamn, ouyangx, wangh, surs,panda}@cse.ohio-‐state.edu

Network-‐Based CompuLng Laboratory h`p://nowlab.cse.ohio-‐state.edu/

28

MVAPICH Web Page h`p://mvapich.cse.ohio-‐state.edu/

Memcached(Design(on(High(Performance(...

Documents

Transcript of Memcached(Design(on(High(Performance(...