Bluestore

BLUESTORE: A NEW, FASTER STORAGE BACKENDFOR CEPH

SAGE WEILVAULT – 2016.04.21

As told by

Allen Samuels

Engineering Fellow, SanDisk

[email protected]

Ceph Day, Portla

nd 2016.5.25

mailto:[email protected]

2

OUTLINE

Ceph background and context FileStore, and why POSIX failed us NewStore – a hybrid approach

BlueStore – a new Ceph OSD backend Metadata Data

PerformanceUpcoming changesSummaryUpdate since Vault (4.21.2016)

MOTIVATION

4

ObjectStore abstract interface for storing local

data EBOFS, FileStore

EBOFS a user-space extent-based object

file system deprecated in favor of FileStore on

btrfs in 2009

Object data (file-like byte stream) attributes (small key/value) omap (unbounded key/value)

Collection placement group shard (slice of the

RADOS pool) sharded by 32-bit hash value

All writes are transactions Atomic + Consistent + Durable Isolation provided by OSD

OBJECTSTORE AND DATA MODEL

5

FileStore PG = collection = directory object = file

Leveldb large xattr spillover object omap (key/value) data

Originally just for development... later, only supported backend (on

XFS)

/var/lib/ceph/osd/ceph-123/ current/

meta/ osdmap123 osdmap124

0.1_head/ object1 object12

0.7_head/ object3 object5

0.a_head/ object4 object6

db/ <leveldb files>

FILESTORE

6

OSD carefully manages consistency of its dataAll writes are transactions

we need A+C+D; OSD provides IMost are simple

write some bytes to object (file) update object attribute (file xattr) append to update log (leveldb insert)

...but others are arbitrarily large/complex

[ { "op_name": "write", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "length": 4194304, "offset": 0, "bufferlist length": 4194304 }, { "op_name": "setattrs", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { "_": 269, "snapset": 31 } }, { "op_name": "omap_setkeys", "collection": "0.6_head", "oid": "#0:60000000::::head#", "attr_lens": { "0000000005.00000000000000000006": 178, "_info": 847 } }]

POSIX FAILS: TRANSACTIONS

7

POSIX FAILS: TRANSACTIONSBtrfs transaction hooks

/* trans start and trans end are dangerous, and only for * use by applications that know how to avoid the * resulting deadlocks */ #define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6) #define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7)

Writeback ordering #define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)

What if we hit an error? ceph-osd process dies? #define BTRFS_MOUNT_WEDGEONTRANSABORT (1 << …) There is no rollback...


8

POSIX FAILS: TRANSACTIONSWrite-ahead journal

serialize and journal every ObjectStore::Transaction then write it to the file system

Btrfs parallel journaling periodic sync takes a snapshot on restart, rollback, and replay journal against appropriate snapshot

XFS/ext4 write-ahead journaling periodic sync, then trim old journal entries on restart, replay entire journal lots of ugly hackery to deal with events that aren't idempotent

e.g., renames, collection delete + create, …full data journal → we double write everything → ~halve disk throughput


9

POSIX FAILS: ENUMERATIONCeph objects are distributed by a 32-bit hashEnumeration is in hash order

scrubbing “backfill” (data rebalancing, recovery) enumeration via librados client API

POSIX readdir is not well-orderedNeed O(1) “split” for a given shard/range

Build directory tree by hash-value prefix split any directory when size > ~100 files merge when size < ~50 files read entire directory, sort in-memory

…DIR_A/DIR_A/A03224D3_qwerDIR_A/A247233E_zxcv…DIR_B/DIR_B/DIR_8/DIR_B/DIR_8/B823032D_fooDIR_B/DIR_8/B8474342_barDIR_B/DIR_9/DIR_B/DIR_9/B924273B_bazDIR_B/DIR_A/DIR_B/DIR_A/BA4328D2_asdf…

NEWSTORE

11

NEW STORE GOALSMore natural transaction atomicityAvoid double writesEfficient object enumerationEfficient clone operationEfficient splice (“move these bytes from object X to object Y”)

Efficient IO pattern for HDDs, SSDs, NVMeMinimal locking, maximum parallelism (between PGs)

Advanced features full data and metadata checksums compression

12

NEWSTORE – WE MANAGE NAMESPACEPOSIX has the wrong metadata model for usOrdered key/value is perfect match

well-defined object name sort order efficient enumeration and random lookup

NewStore = rocksdb + object files /var/lib/ceph/osd/ceph-123/

db/ <rocksdb, leveldb, whatever>

blobs.1/ 0 1 ...

blobs.2/ 100000 100001 ...

HDD

OSD

SSD SSD

OSD

HDD

OSD

NewStore NewStore NewStoreRocksD

BRocksD

B

13

RocksDB has a write-ahead log “journal”XFS/ext4(/btrfs) have their own journal (tree-log)Journal-on-journal has high overhead

each journal manages half of overall consistency, but incurs same overhead

write(2) + fsync(2) to new blobs.2/103021 write + flush to block device

1 write + flush to XFS/ext4 journalwrite(2) + fsync(2) on RocksDB log

1 write + flush to block device 1 write + flush to XFS/ext4 journal

NEWSTORE FAIL: CONSISTENCY OVERHEAD

14

We can't overwrite a POSIX file as part of a atomic transactionWriting overwrite data to a new file means many files for each objectWrite-ahead logging?

put overwrite data in a “WAL” records in RocksDB commit atomically with transaction then overwrite original file data but we're back to a double-write of overwrites...

Performance sucks againOverwrites dominate RBD block workloads

NEWSTORE FAIL: ATOMICITY NEEDS WAL

BLUESTORE

16

BlueStore = Block + NewStore consume raw block device(s) key/value database (RocksDB) for metadata data written directly to block device pluggable block Allocator

We must share the block device with RocksDB implement our own rocksdb::Env implement tiny “file system” BlueFS make BlueStore and BlueFS share

BLUESTORE

BlueStore

BlueFS

RocksDB

BlockDeviceBlockDeviceBlockDevice

BlueRocksEnv

data metadata

Allo

cato

r

ObjectStore

17

ROCKSDB: BLUEROCKSENV + BLUEFSclass BlueRocksEnv : public rocksdb::EnvWrapper

passes file IO operations to BlueFSBlueFS is a super-simple “file system”

all metadata loaded in RAM on start/mount no need to store block free list coarse allocation unit (1 MB blocks) all metadata lives in written to a journal journal rewritten/compacted when it gets

largesuperblock journal …

more journal …data

datadata

data

file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ...

Map “directories” to different block devices

db.wal/ – on NVRAM, NVMe, SSD db/ –

level0 and hot SSTs on SSD db.slow/ – cold SSTs on HDD

BlueStore periodically balances free space

18

ROCKSDB: JOURNAL RECYCLINGProblem: 1 small (4 KB) Ceph write → 3-4 disk IOs!

BlueStore: write 4 KB of user data rocksdb: append record to WAL

write update block at end of log file fsync: XFS/ext4/BlueFS journals inode size/alloc update to its journal

fallocate(2) doesn't help data blocks are not pre-zeroed; fsync still has to update alloc metadata

rocksdb LogReader only understands two modes read until end of file (need accurate file size) read all valid records, then ignore zeros at end (need zeroed tail)

19

ROCKSDB: JOURNAL RECYCLING (2)Put old log files on recycle list (instead of deleting them)LogWriter

overwrite old log data with new log data include log number in each record

LogReader stop replaying when we get garbage (bad CRC) or when we get a valid CRC but record is from a previous log incarnation

Now we get one log append → one IO!

Upstream in RocksDB! but missing a bug fix (PR #881)

Works with normal file-based storage, or BlueFS

METADATA

21

BLUESTORE METADATAPartition namespace for different metadata

S* – “superblock” metadata for the entire store B* – block allocation metadata

C* – collection name → cnode_t O* – object name → onode_t or enode_t L* – write-ahead log entries, promises of future IO

M* – omap (user key/value data, stored in objects) V* – overlay object data (obsolete?)

22

Per object metadata Lives directly in key/value pair Serializes to ~200 bytes

Unique object id (like ino_t)Size in bytesInline attributes (user attr data)Block pointers (user byte data)

Overlay metadataOmap prefix/ID (user k/v data)

struct bluestore_onode_t { uint64_t nid; uint64_t size; map<string,bufferptr> attrs; map<uint64_t,bluestore_extent_t> block_map; map<uint64_t,bluestore_overlay_t> overlay_map; uint64_t omap_head;};

struct bluestore_extent_t { uint64_t offset; uint32_t length; uint32_t flags;};

ONODE

23

Collection metadata Interval of object namespace

shard pool hash name bitsC<NOSHARD,12,3d321e00> “12.e123d3” = <25>

shard pool hash name snap genO<NOSHARD,12,3d321d88,foo,NOSNAP,NOGEN> = …O<NOSHARD,12,3d321d92,bar,NOSNAP,NOGEN> = …O<NOSHARD,12,3d321e02,baz,NOSNAP,NOGEN> = …O<NOSHARD,12,3d321e12,zip,NOSNAP,NOGEN> = …O<NOSHARD,12,3d321e12,dee,NOSNAP,NOGEN> = …O<NOSHARD,12,3d321e38,dah,NOSNAP,NOGEN> = …

struct spg_t { uint64_t pool; uint32_t hash; shard_id_t shard;};

struct bluestore_cnode_t { uint32_t bits;};

Nice properties Ordered enumeration of objects We can “split” collections by adjusting metadata

CNODE

24

Extent metadata Sometimes we share blocks between objects (usually

clones/snaps) We need to reference count those extents We still want to split collections and repartition extent

metadata by hash struct bluestore_enode_t { struct record_t { uint32_t length; uint32_t refs; }; map<uint64_t,record_t> ref_map;

void add(uint64_t offset, uint32_t len, unsigned ref=2); void get(uint64_t offset, uint32_t len); void put(uint64_t offset, uint32_t len, vector<bluestore_extent_t> *release);};

ENODE

DATA PATH

26

TermsSequencer

An independent, totally ordered queue of transactions

One per PGTransContext

State describing an executing transaction

DATA PATH BASICSTwo ways to writeNew allocation

Any write larger than min_alloc_size goes to a new, unused extent on disk

Once that IO completes, we commit the transaction

WAL (write-ahead-logged) Commit temporary promise to

(over)write data with transaction includes data!

Do async overwrite Clean up temporary k/v pair

27

TRANSCONTEXT STATE MACHINE

PREPARE AIO_WAIT

KV_QUEUED KV_COMMITTING

WAL_QUEUED WAL_AIO_WAIT

FINISH

WAL_CLEANUP

Initiate some AIO

Wait for next TransContext(s) in Sequencer to be ready

PREPAREAIO_WAITKV_QUEUEDAIO_WAIT

KV_COMMITTINGKV_COMMITTINGKV_COMMITTING

FINISHWAL_QUEUEDWAL_QUEUEDFINISH

WAL_AIO_WAITWAL_CLEANUPWAL_CLEANUP

Sequencer queue

Initiate some AIO

Wait for next commit batch FINISHFINISHWAL_CLEANUP_COMMITTING

28

From open(2) man page

By default, all IO is AIO + O_DIRECT but sometimes we want to cache (e.g., POSIX_FADV_WILLNEED)

BlueStore mixes direct and buffered IO O_DIRECT read(2) and write(2) invalidate and/or flush pages... racily we avoid mixing them on the same pages disable readahead: posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM)

But it's not quite enough... moving to fully user-space cache

AIO, O_DIRECT, AND CACHING

Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file.

29

FreelistManager persist list of free extents to

key/value store prepare incremental updates for

allocate or releaseInitial implementation

<offset> = <length> keep in-memory copy enforce an ordering on commits

del 1600=100000 put 1700=99990

small initial memory footprint, very expensive when fragmented

New approach <offset> = <region bitmap> where region is N blocks (1024?) no in-memory state use k/v merge operator to XOR

allocation or release merge 10=0000000011 merge 20=1110000000

RocksDB log-structured-merge tree coalesces keys during compaction

BLOCK FREE LIST

30

Allocator abstract interface to allocate new

space

StupidAllocator bin free extents by size (powers of

2) choose sufficiently large extent

closest to hint highly variable memory usage

btree of free extents implemented, works based on ancient ebofs policy

BitmapAllocator hierarchy of indexes

L1: 2 bits = 2^6 blocks L2: 2 bits = 2^12 blocks ... 00 = all free, 11 = all used, 01 = mix

fixed memory consumption ~35 MB RAM per TB

BLOCK ALLOCATOR

31

Let's support them natively!256 MB zones / bands

must be written sequentially, but not all at once

libzbc supports ZAC and ZBC HDDs host-managed or host-aware

SMRAllocator write pointer per zone used + free counters per zone Bonus: almost no memory!

IO ordering must ensure allocated writes reach

disk in orderCleaning

store k/v hints zone → object hash

pick emptiest closed zone, scan hints, move objects that are still there

opportunistically rewrite objects we read if the zone is flagged for cleaning soon

SMR HDD

FANCY STUFF

33

Full data checksumsWe scrub... periodicallyWe want to validate checksum on every read

Compression3x replication is expensiveAny scale-out cluster is expensive

WE WANT FANCY STUFF

34

Full data checksumsWe scrub... periodicallyWe want to validate checksum on every read

More data with extent pointer 4KB for 32-bit csum per 4KB block bigger onode: 300 → 4396 bytes larger csum blocks?

Compression3x replication is expensiveAny scale-out cluster is expensive

Need largish extents to get compression benefit (64 KB, 128 KB)

overwrites need to do read/modify/write

WE WANT FANCY STUFF

35

INDIRECT EXTENT STRUCTURES

map<uint64_t,bluestore_lextent_t> extent_map;

struct bluestore_lextent_t { uint64_t blob_id; uint64_t b_off, b_len;};

map<uint64_t,bluestore_blob_t> blob_map;

struct bluestore_blob_t { vector<bluestore_pextent_t> extents; uint32_t logical_length; ///< uncompressed length uint32_t flags; ///< FLAGS_* uint16_t num_refs; uint8_t csum_type; ///< CSUM_* uint8_t csum_block_order; vector<char> csum_data;};

struct bluestore_pextent_t { uint64_t offset, length;};

onode_t

onode_t or enode_t

data on block device

onode may reference a piece of a blob,or multiple pieces

csum block size can vary

ref counted (when in enode)

PERFORMANCE

37

4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K0

50

100

150

200

250

300

350

400

450

500

Ceph 10.1.0 Bluestore vs Filestore Sequential Writes

FS HDDBS HDD

IO Size

Thro

ughp

ut (M

B/s

)

HDD: SEQUENTIAL WRITE

38

HDD: RANDOM WRITE

4K 8K 16K

32K

64K

128K

256K

512K

1024

K20

48K

4096

K0

50100150200250300350400450

Ceph 10.1.0 Bluestore vs Filestore Random Writes

FS HDDBS HDD

IO Size

Thro

ughp

ut (M

B/s

)

4K 8K 16K

32K

64K

128K

256K

512K

1024

K20

48K

4096

K0

200400600800

1000120014001600

Ceph 10.1.0 Bluestore vs Filestore Random Writes

FS HDDBS HDD

IO Size

IOPS

39

HDD: SEQUENTIAL READ

4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K0

200

400

600

800

1000

1200

Ceph 10.1.0 Bluestore vs Filestore Sequential Reads

FS HDDBS HDD

IO Size

Thro

ughp

ut (M

B/s

)

40

HDD: RANDOM READ

4K 8K 16K

32K

64K

128K

256K

512K

1024

K20

48K

4096

K0

200

400

600

800

1000

1200

1400

Ceph 10.1.0 Bluestore vs Filestore Random Reads

FS HDDBS HDD

IO Size

Thro

ughp

ut (M

B/s

)

4K 8K 16K

32K

64K

128K

256K

512K

1024

K20

48K

4096

K0

500

1000

1500

2000

2500

3000

3500

Ceph 10.1.0 Bluestore vs Filestore Random Reads

FS HDDBS HDD

IO Size

IOPS

41

SSD AND NVME?NVMe journal

random writes ~2x faster some testing anomalies (problem with test rig kernel?)

SSD only similar to HDD result small write benefit is more pronounced

NVMe only more testing anomalies on test rig.. WIP

42

AVAILABILITYExperimental backend in Jewel v10.2.z (just released)

enable experimental unrecoverable data corrupting features = bluestore rocksdb

ceph-disk --bluestore DEV no multi-device magic provisioning just yet

The goal... stable in Kraken (Fall '16)? default in Luminous (Spring '17)?

43

SUMMARY

Ceph is greatPOSIX was poor choice for storing objectsOur new BlueStore backend is awesomeRocksDB rocks and was easy to embedLog recycling speeds up commits (now upstream)Delayed merge will help too (coming soon)

THANK YOU!Sage Weil

CEPH PRINCIPAL ARCHITECT

[email protected]

@liewegas

As told byAllen Samuels

Engineering Fellow, SanDisk

[email protected]

Ceph Day, Portland 2016.5.25

mailto:[email protected]

45

Status Update (2016.5.25)Substantial re-implementation for enhanced features and performanceNew extent structure to easily support compression and checksums (PR published yesterday)

New disk format, NOT backward compatible with Tech PreviewOverlay / WAL mechanism rebuilt to separate policy from mechanismBitmap allocator to reduce memory usage and unnecessary backend serialization on flash

Extends KeyValueDB to utilize Merge conceptBlueFS log redesign to avoid stall on compactionUser-space block cacheGSoC student working on SMR supportSanDisk ZetaScale open sourced and being prepped as RocksDB replacement on flashGoal of being stable in K release is still in sight!

46

RockDB Tuning

• RockDB tuned for better performance• opt.IncreaseParallelism (16);• opt.OptimizeLevelStyleCompaction (512 * 1024 * 1024);

1 27 53 79 1051311571832092352612873133393653914174434694955215475735996256510

1000

2000

3000

4000

5000

60004K 100% write( RocksDB Tuned Vs Default options)

RocksDB(Tuned Options)RocksDB(Default Option)

Runtime

TPS

47

BlueStore performance RocksDB Vs ZetaScale

0/100 70/30 99/10

5

10

15

20

25

30

RBD 4K workload TPS

BlueStore with RocksDBBlueStore with ZetaScale

Read/Write Ratio

TPS

in K

s

Observation

• Bluestore with ZetaScale provides 2.2x improvement in read performance compared to RocksDB

• Bluestore with ZetaScale provides 20% improvement in mixed read/write compared to RocksDB

• Bluestore with ZetaScale100% write performance is 10% less than the RocksDB

48


1 31 61 91 1211511812112412713013313613914214514815115415716016316610

500

1000

1500

2000

2500

30004K 100% write TPS

RocksDB

ZetaScale

Run time

TPS

Observation

• ZetaScale performance is stable

• RocksDB performance is choppy due to compaction

• RocksDB write performance keeps degrading as data size grows because the compaction overhead grows with the data size

49

1 29 57 85 1131411691972252532813093373653934214494775055335615896170

1000

2000

3000

4000

5000

600070% read and 30 % write

RocksDB read

ZetaScale Reac

RocksDB write

ZetaScale write

Runtime

TPS


Observation

• ZetaScale read performance is stable

• RocksDB read performance is choppy due to compaction

50

Read/Write Ratio TPS in Ks IO util% CPU %Ios (rd+wr) Per sec

Read KB per sec

Write KB per sec

IOs (rd+wr) per TPS

CPU time in ms per TPS

Disk BW(rd+wr) KB per TPS

0/100 RocksDB 1.59 98 171 19094 106926 38290 12.0 1.1 91.3

0/100 ZetaScale 1.42 98 106 11970 28821 48877 8.4 0.7 54.7

70/30 RocksDB 3.59 100 233 24865 143867 33430 6.9 0.6 49.4

70/30 ZetaScale 4.28 98 178 16870 62331 44989 3.9 0.4 25.1

100/0 RocksDB 11.25 100 407 36370 215546 7300 3.2 0.4 19.8

100/0 ZetaScale 24.27 100 578 50109 291947 11537 2.1 0.2 12.5


Observation

• For the observed performance advantage with ZetaScale, the following are highlights of system resource utilizationIO : BlueStore with Zeta Scale uses up to 43% less IOs compared to RocksDB CPU : BlueStore with Zeta Scale uses 35% less cpu compared to RocksDBDisk BW : BlueStore with Zeta Scale uses up to 50% less bandwidth compared to RocksDB

51

Tasks in progress• Multiple Blocks performance

• 16K, 64K, 256K

• Large Scale testing • Multiple OSDs with 128TB capacity

• Runs with worst case metadata size• Onode size grows as test progress due to random writes • RocksDB performance keeps degrading as the dataset size

increases. • Measure performance under this condition and compare

• Tune ZetaScale for further improvement

Bluestore

Technology

Transcript of Bluestore