Cephfs jewel mds performance benchmark

19
CephFS Jewel MDS Performance Xiaoxi CHEN([email protected]) Zhiteng HUANG([email protected])

Transcript of Cephfs jewel mds performance benchmark

Page 1: Cephfs jewel mds performance benchmark

CephFS Jewel MDS Performance

Xiaoxi CHEN([email protected]) Zhiteng HUANG([email protected])

Page 2: Cephfs jewel mds performance benchmark

Who are we and why cephFS?• We are from ebay private cloud team

Page 3: Cephfs jewel mds performance benchmark

3

Questions to be answered• What’s the maximum throughput of cephFS?

• Break into metadata and data.• Data path is the common with rbd/rgw usage, should be good.• Metadata path (i.e MDS) is unique in cephFS

• How much metadata OP/s can single MDS handle?• As Active-Active MDS cluster is not stable yet, single MDS performance limit the performance of single

filesystem.

• Can MDS scale up or scale out?• Scale up: can utilize CPU cores in single node.• Scale out: MDS AA? Multi-FS?

• What’s the best way to use cephFS, kernel or fuse?

Page 4: Cephfs jewel mds performance benchmark

4

Hardware Environment• 24 OSD nodes, each with 20 NR-SAS Drive, self journaling, total 480

OSDs.

• 1 MDS node, with 24 Cores and 256GB Ram.

• 3 VM as clients, each with 8 cores and 24GB Ram.

• All components are 10Gb line speed connected.

Page 5: Cephfs jewel mds performance benchmark

5

Hardware EnvironmentsOSD Nodes

CPU E5-2640-V3(2 process, 16 cores/32 threads in total)

RAM 128GB

Storage HBA PMC 8885

SAS HDD 24* 6TB NR-SAS

NIC Dual 10GbE bonding mode 6

MDS Node

CPU E5-2680-V3(2 process, 24 cores/48 threads in total)

RAM 256GB

NIC 10GbE

Page 6: Cephfs jewel mds performance benchmark

6

Software stackCeph OSD Ceph MDS client

Ceph version Jewel Jewel Jewel

OS Ubuntu 14.04 Ubuntu 14.04 Ubuntu 14.04

Kernel version 3.13 3.13 4.2.0-38

Pool name PG_num size type

data 8192 3 Replicated

metadata 32768 3 Replicated

Page 7: Cephfs jewel mds performance benchmark

7

Test1 : ceph fuse vs kernel client• Methodology• time for i in `seq 10000`; do echo hello > file${i}; done

• Create 4096 files with O_DSYNC, each file is 1KB sized.

Fuse Kernel (4.4)

Time(s) 12.461 3.912

Fuse Kernel (4.4)

Time(s) 8.13 1.62

• Kernel client looks significantly faster than Fuse, focus on kernel client on latter test.

Page 8: Cephfs jewel mds performance benchmark

8

Test 2 File creation• In this test, we created 300 Millions of small files (1KB) in to the cluster, to

see• How creation speed varies with # of files created?• Will Ceph crash after certain number of files created?

• The test was done by a very simple single thread tester written by C, basic flow is• open file with (O_SYNC|O_CREATE),• write 1KB data into it• close the file.

Page 9: Cephfs jewel mds performance benchmark

9

Test 2.1 Create 300M files into a single dir --- MDS Hang.

• The creation progress goes well, but…• Any “ls” “open” in the DIR will make MDS hanging

• This is because the metadata of this DIR is larger than max_message_size (2GB), cause the interger overflow.• Bug report

• http://tracker.ceph.com/issues/16010• Fixes:

• https://github.com/ceph/ceph/pull/9395, which fixed the integer overflow but still uint32, which means just bumping the limit to 4GB. Not backporting to Jewel.

• https://github.com/ceph/ceph/pull/9789, limit # files per DIR to 100K unless dir_frag is enabled (dir frag is claimed not stable yet, default off), backporting to jewel (hopefully 10.2.3)

• Conclusion• In short term, one need next jewel release to limit #files per dir to prevent overflow.• Dir frag may be the long term solution.

Page 10: Cephfs jewel mds performance benchmark

10

Test 2.2 Create 300M files, 4096 per dir.

• In this test, 3 clients mount the FS via kernel client, and create dedicate working dir for each. • Each client is trying to create 100 Millions of 1KB files, and create a subdir for every 4096 files.

• The dir tree looks like /fs_mount/client[0-2]/ [0 – 25K)_dir / [0-4K)_file

• Creating speed(files per seconds) vs # files created can be see from below figure• X axis -> num files already created. Y axis -> file creation tps.• At very beginning of the test, creation speed can reach ~ 6000 files/s , this is probably what Ben had seen in his benchmark.• Performance fluctuate from 1000~ 2000 tps after stable(> 20M files already there)

Page 11: Cephfs jewel mds performance benchmark

11

Test 2.2 Something bad.• Client failing to respond to cache pressure, potentially cause MDS killed by OOM.

• Ceph health and mds log complaining:• mds0: Behind on trimming (7581/30);• mds0: Client cephfs-c01-553133 failing to respond to cache pressure;• mds0: Client cephfs-c03-553275 failing to respond to cache pressure;• mds0: Client cephfs-c02-553274 failing to respond to cache pressure;

• From the perf dump of MDS, can see the inode_max specified is 100K, while it actually cached 4 millions• "inode_max": 100000,• "inodes": 4195342,• "inodes_top": 0,• "inodes_bottom": 0,• "inodes_pin_tail": 4195342,• "inodes_pinned": 4195342,• "inodes_expired": 10007,• “inodes_with_caps”: 4195333,

• Impact• This suppose to be a client side bug.• MDS can be killed due to OOM. Misbehave client can take down the MDS cluster by holding and never release inodes.• Umount the client can release the holding inodes

• There are several similar reports can be found in ceph-use ML• Zheng Yan said this bug is likely be fixed by commit 5e804ac482 "ceph: don't invalidate page cache when inode is no longer used”,

which is in 4.4 kernel, but seems not as people using 4.4 or 4.6 also have the issue

Page 12: Cephfs jewel mds performance benchmark

12

Test 2.3 Open File Benchmark• Methodology

• Client mount cephFS via kernel client• Drop client side inode cache, dentries and pagecache before each test.• Restart MDS for non-caching mode.• Randomly picked a file from the first 4096 dirs, which contains 16 million files, open it then close

immediately, repeat do 1 Million random files.

• mds_cache_size ( How many inodes can stay in mds memory cache)• scale from 100K to 100 million( 16 Million actually occupied).

• 2 mode is tested:• Non-caching

• Restart the MDS before testing, this will drop MDS inode cache, initial the random generator seed by time(NULL).• Caching

• Initial the random generator seed by a constant number, which will ensure the footprint exactly the same between each run.

Page 13: Cephfs jewel mds performance benchmark

13

Test 2.3 Open File BenchmarkMds cache size RSS(MB) MB/ (K

inodes)Time to Finish(Non-cache)

TPS Time to Finish(cached)

TPS

100k 678 6.78 30 Minutes + n/a 30 Minutes + n/a

1 Million 4232 4.23 30 Minutes + n/a 30 Minutes + n/a

10 Million 32514 3.25 30 Minutes + n/a 30 Minutes + n/a

16.804 Million 65184 3.88 928s 1077 216s 4545

• When caching size < working set size, the test cannot finished in 30 minutes.• Our workload is purely random, usually the last dir is not in cache, thus open a file need to load the

dentries of the dir(4096 inodes).• From MDS log can see lots of cache evict.

• When caching size > working set , the maximum TPS measured is 4545• In all cases, the MDS hit the CPU bond as MDS is single thread ( top show 105% cpu usage).• Note that mds cache size is NOT A HARD LIMIT, see Test 2.2 for detail.

Page 14: Cephfs jewel mds performance benchmark

14

Test 2.4 File meta update test• Methodology

• Client mounts the cephFS via kernel client.• Drop client side inode cache, dentries and pagecache before each test.• Randomly picked a file from the first 4096 dirs, which contains 16 million files, use utime() to

change the access time, repeat try 1 Million random files.

• mds_cache_size pin to 100Million

• 2 mode is tested:• Non-caching: Restart the MDS before testing, this will drop MDS inode cache, initial the random

generator seed by time(NULL).• Caching: Initial the random generator seed by a constant number, which will ensure the footprint

exactly the same between each run.

Page 15: Cephfs jewel mds performance benchmark

15

Test 2.4 File meta update test Time to finish TPS

Non caching mode 955s 1047

Caching mode 243s 4115

• Performance is (very) similar to open file test• Indicating utime/mtime update is fairly light weight.

• Just log the update and update in memory, lazy flush.• CPU bond again (saturate single core)

Page 16: Cephfs jewel mds performance benchmark

16

Test 2.5 File Rename Test• Methodology

• Client mounting the cephFS via kernel client.• Drop client side inode cache, dentries and pagecache before each test.• iterator from dir X to dir (Y+4096), rename all files(4096 files per dir) from $filename to

$filename + "_renamed"

Test TPS

Non caching mode 1431

caching mode 1434

caching mode+ skip mds log 2048

• By setting mds_log = false, mds will skip logging the operation (and also flush skip flushing the log to omap.

• Caching & non-caching show similar performance, because this test sequentially touch the files.

Page 17: Cephfs jewel mds performance benchmark

17

Methodologies summary.• Drop page cache is not enough, restart MDS daemon is the way to clear MDS

cache.

• Try to have enough background data sitting on the FS.• CephFS heavily rely on omap, omap based on key-value database, the write amplification

of k-v DB significantly relate with data amount.

• Use mds_log = false to skip write IO generated by MDS, which is helpful to estimate upper bound performance.

• Long run & Repeat the test.

Page 18: Cephfs jewel mds performance benchmark

18

Take away.• MDS operation is single thread, which is the primary limiter for performance.

• Test 2.3~2.5• Can we break the mds_lock into finer grain one?

• MDS Cache size has significant impact on performance as well as memory usage.(Test 2.3)• But misbehavior client can cause MDS exceed the limit (Test 2.2) and potentially cause OOM.

• Metadata pool on SSD will not bring significant benefit (Test 2.5)

• Ceph kernel client has better performance than Fuse(Test 1)• Kernel client lack of quota support.• Need to upgrade to extremely new kernel to get bug fix and features.

• Never put too much files in a single DIR.( Test 2.1)

Page 19: Cephfs jewel mds performance benchmark

19

Executive summary• Create/Open/Utime/Rename are tested.

• Basically most of the case is CPU constrained due to MDS core is single threaded.

• Several bugs were found and reported, community works actively to fix.

• MDS need scale-out solution as single MDS limited by single threads.• Active Active MDS ------ Seems will not be ready in one release cycle?• Multi-fs ----- much simpler than MDS AA, basically workable, Gaps are:

• Kernel client support (https://github.com/torvalds/linux/commit/235a09821c) is in 4.7 RC, still need time to land to distro.

• Some bugs in multi-fs support is back porting, should be in next jewel release.• Community working on put each FS metadata in a different namespace rather than separate pools, but none

of the fsck tools are prepared to see multiple inodes of the same number distinguished only by namespace.