IMCa: A High-Performance Caching Front-end for GlusterFS...

Post on 16-Mar-2018

220 views 0 download

Transcript of IMCa: A High-Performance Caching Front-end for GlusterFS...

IMCa: A High-Performance Caching Front-end for GlusterFS on InfiniBand

Ranjit Noronha and Dhabaleswar K. PandaNetwork Based Computing Lab

The Ohio State University<noronha, panda>@cse.ohio-state.edu

Outline of the Talk

• Background and Motivation

• Architecture and Design of IMCa

• Experimental Evaluation of IMCa

• Conclusions and Future Work

Background

• Large Scale Scientific and Commercial Workloads

• Petascale Computers have arrived

• High-Performance access to the I/O data is crucial – Parallel applications is often limited by I/O

• Clustered/Parallel File Systems have evolved to meet this challenge

• File System performance still dependent on disk performance

• Single Server Bandwidth Drop With Multiple Clients

• Parallel I/O Bandwidth From Multiple Servers

0

500

1000

1 2 3 4 5 6 7 8

RDMA IPoIB GigE

0

500

1000

1 2 3 4 5 6 7 8

R DMA IP oIB G igE

Ba

nd

wid

th (

Meg

aB

yte

s/s)

Number of clients

4GB Server Memory

8GB Server Memory

• Performance for Small Files– Generally difficult to achieve– Many environments with a large number of small files– Storing on the same disk block provides limited benefit– Striping does not provide benefit– Store on different servers

• Cache Coherency Problems– Client side cache provides good performance– Non-coherent client cache limited when there is sharing– Limited Scalability of coherent caches

• Server Load Problems– RDMA reduces overhead from TCP/IP– RDMA based transport protocols cannot reduce copying costs

within the file system

Problem Statement• Which file-system operations are potential

targets for caching?• What are the alternatives to the traditional

client cache/server cache architecture?• What are the advantages and disadvantages

of alternate cache architectures?• How do we provide the performance of the

non-coherent client cache without the scalability problems of the coherent client cache?

Outline of the Talk

• Background and Motivation

• Architecture and Design of IMCa

• Experimental Evaluation of IMCa

• Conclusions and Future Work

Potential File System Operations That May Be Cached

• Potential Targets For Caching– Should be something the client reads– Should be possible to uniquely identify cache target– Should be possible to chunk the data element

• Small Operations Stat, Create, Delete, Open• Stat

– Read by the client – Used as a form of update by many applications– Should be used – Should be updated on read/write operations on the server

• Create/delete– Not read by the client– Delete should invalidate previous cache entries

• File Open– Not a target for caching, but may be used for prefetching

• Data Transfer Operations – Read and Writes– Blocks Needed

Intermediate Cache Architecture (IMCa)

• Easy to maintain coherency• Extensible• Can multiple Cache nodes

provide benefit?

FS Client

SMCache

Underlying FS Cache

Cache1 Cache2 CachenEach Cache is a node (MCD Array)

Hash Function (CRC32) toFind the Cache Server

Hash Function (CRC32) toFind the Cache Server

CMCache

ext3

Need for Blocks In IMCa

• Most file system store data on the disk as blocks

• Parallel file-systems stripe data across multiple servers

• IMCa uses a fixed block size to store data across the cache servers– Block size should provide good performance

for most small files– Should avoid

• excessive fragmentation

Requested Data

Data Block Boundaries

Extradata File data segmented

by IMCa blocksize

Design-Read Operations (Hit)Client

SMCache

Underlying FS Cache

Cache1 Cache2 Cachen

CMCache

ext3

Design-Read Operations (Miss)Client

SMCache

Underlying FS Cache

Cache1 Cache2 Cachen

CMCache

ext3

Design-Write Operations

Client

SMCache

Underlying FS Cache

Cache1 Cache2 Cachen

CMCache

ext3

Advantages/Disadvantages of IMCa

• Fewer Requests Hit the Server• Latency for requests read from the cache is

lower• MCDs are self-managing• Failures in MCDs do not impact correctness• Additional node elements needed especially

for caching• Cold Misses are expensive• Additional Blocks/Data Transfers Needed• Overhead and delayed updates

Outline of the Talk

• Background and Motivation

• Architecture and Design of IMCa

• Experimental Evaluation of IMCa

• Conclusions and Future Work

GlusterFS File System

• Clustered File System

• Client and Server in userspace

• Use FUSE interface to translate FS calls from the kernel to the user daemons

• No Stripping data distributed across servers

• Possible to apply translators at the server and client to perform different functions

• WWW.glusterfs.org

Experimental Setup

• 64-node cluster– 8-core Intel Clovertown– 8 GB memory

• InfiniBand DDR is the interconnect• GlusterFS file-system• The data servers each have 8 RAID highpoint disks• Communication protocol is IPoIB in Reliable Connected

(RC) mode• MCDs run on independent nodes and use up to 6GB of

memory • CMCache and SMCache use a CRC32 hash function for

locating data on the MCDs• Lustre 1.6.4.3 is used with a socklnd for comparison

Experiment-stat

– Consists of two stages

– First stage (untimed)• 262144 files created by a single node

– Second stage (timed)• each node tries to perform a stat on each of the 262144 files sequentially

Stat performance

• Time to stat 262144 different files• Benchmark has two phase create (untimed), followed by stat (timed)• 82% improvement at 64 nodes

0

500

1000

1500

2000

2500

3000

3500

16 32 64

No Cache 1 Cache Server 2

2 Cache Servers 4 Cache Servers

6 Cache Servers Lustre-4DS

Number of Nodes

Tim

e (s

eco

nd

s)

0

100

200

300

400

500

1 2 4 8

No Cache 1 Cache Server 2

2 Cache Servers 4 Cache Servers

6 Cache Servers Lustre-4DS

Experiment-Write Single Client

– One Client

– Writes 1,024 records of size r sequentially to the file

– Measure time for this to complete

Write – Single Client

0200400600800

1000120014001600

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

No Cache IMCa (2K) IMCa (Server Threads)

Write

La

ten

cy (

mic

rose

con

ds)

I/O Record Size (bytes)

• 2KB block size• Server thread helps performance

Experiments-Read

• Single Client Read– Follows Write component of the benchmark

– Move file pointer to the beginning of the file

– Read 1,024 records of size r sequentially to the file

– Measure time for this to complete

• Multiple Client Read– Each client uses a separate file

• Multiple Client Read Shared – Same file used by every client

• Lustre configurations– Cold Client Cache Unmount between Write and Read

– Warm Client Cache No unmount between Write and Read

Read Latency (Single Client)

0

200

400

600

800

1000

1200

1 4 16 64 256

1024

4096

1638

4

6553

6

No Cache Cache (256)

Cache (2K) Cache (8K)

Lustre-1DS (Cold) Lustre-4DS (Cold)

Lustre-4DS (Warm)

0

5000

10000

15000

20000

25000

No Cache Cache (256)

Cache (2K) Cache (8K)

Lustre-1DS (Cold) Lustre-4DS (Cold)

Lustre-4DS (Warm)

La

ten

cy (

us)

Bytes•Lustre shows best latency•Cache provides benefit for small message sizes

Read Multiple Client (32 clients)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1 2 4 8

16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

32

76

8

65

53

6

NoCache

IMCa (1)

IMCa (2)

IMCa (4)

Lustre (Cold)

Lustre (Warm)

•51% improvement in latency at 16K•Multiple MCDs help reduce capacity misses

Bytes

Tim

e (m

icro

seco

nd

s)

Iozone throughput

0

200

400

600

800

1000

No Cache Cache (1) Cache (2) Cache (4) Lustre-1DS (Cold)

1

2

4

8

Th

rou

gh

pu

t (M

ega

By

tes/

seco

nd

)

•1, 2, 4, 8 IOzone threads, 1GB files, 2KB block size•325 MB/s (NoCache) -> 868 MB/s (4 MCDs)

Read-Shared Latency

0

200

400

600

800

1000

1200

1400

1600

2 4 8 16 32

No Cache Lustre-1DS (Cold) MCD (1)

Tim

e (

mic

rose

cond

s)

Number of nodes

•IMCa helps improve performance over NoCache case

Outline of the Talk

• Background and Motivation

• Architecture and Design of IMCa

• Experimental Evaluation of IMCa

• Conclusions and Future Work

Conclusions and Future Work

•Proposed, Designed and Evaluated an Intermediate Cache for GlusterFS•Good improvement in stat performance•Improvement in latency/throughput of read operations

•Depends on block size• Would like to evaluate the performance with RDMA•Would like to evaluate distribution algorithms

Acknowledgements

Our research is supported by the following organizations

Thank you

{noronha, panda}@cse.ohio-state.edu

Network-Based Computing Laboratory

http://nowlab.cse.ohio-state.edu/