Bluestore
-
Upload
patrick-mcgarry -
Category
Technology
-
view
335 -
download
1
Transcript of Bluestore
BLUESTORE: A NEW, FASTER STORAGE BACKENDFOR CEPH
SAGE WEILVAULT – 2016.04.21
As told by
Allen Samuels
Engineering Fellow, SanDisk
Ceph Day, Portla
nd 2016.5.25
2
OUTLINE
Ceph background and context FileStore, and why POSIX failed us NewStore – a hybrid approach
BlueStore – a new Ceph OSD backend Metadata Data
PerformanceUpcoming changesSummaryUpdate since Vault (4.21.2016)
MOTIVATION
4
ObjectStore abstract interface for storing local
data EBOFS, FileStore
EBOFS a user-space extent-based object
file system deprecated in favor of FileStore on
btrfs in 2009
Object data (file-like byte stream) attributes (small key/value) omap (unbounded key/value)
Collection placement group shard (slice of the
RADOS pool) sharded by 32-bit hash value
All writes are transactions Atomic + Consistent + Durable Isolation provided by OSD
OBJECTSTORE AND DATA MODEL
5
FileStore PG = collection = directory object = file
Leveldb large xattr spillover object omap (key/value) data
Originally just for development... later, only supported backend (on
XFS)
/var/lib/ceph/osd/ceph-123/ current/
meta/ osdmap123 osdmap124
0.1_head/ object1 object12
0.7_head/ object3 object5
0.a_head/ object4 object6
db/ <leveldb files>
FILESTORE
6
OSD carefully manages consistency of its dataAll writes are transactions
we need A+C+D; OSD provides IMost are simple
write some bytes to object (file) update object attribute (file xattr) append to update log (leveldb insert)
...but others are arbitrarily large/complex
[ { "op_name": "write", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "length": 4194304, "offset": 0, "bufferlist length": 4194304 }, { "op_name": "setattrs", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { "_": 269, "snapset": 31 } }, { "op_name": "omap_setkeys", "collection": "0.6_head", "oid": "#0:60000000::::head#", "attr_lens": { "0000000005.00000000000000000006": 178, "_info": 847 } }]
POSIX FAILS: TRANSACTIONS
7
POSIX FAILS: TRANSACTIONSBtrfs transaction hooks
/* trans start and trans end are dangerous, and only for * use by applications that know how to avoid the * resulting deadlocks */ #define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6) #define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7)
Writeback ordering #define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)
What if we hit an error? ceph-osd process dies? #define BTRFS_MOUNT_WEDGEONTRANSABORT (1 << …) There is no rollback...
POSIX FAILS: TRANSACTIONS
8
POSIX FAILS: TRANSACTIONSWrite-ahead journal
serialize and journal every ObjectStore::Transaction then write it to the file system
Btrfs parallel journaling periodic sync takes a snapshot on restart, rollback, and replay journal against appropriate snapshot
XFS/ext4 write-ahead journaling periodic sync, then trim old journal entries on restart, replay entire journal lots of ugly hackery to deal with events that aren't idempotent
e.g., renames, collection delete + create, …full data journal → we double write everything → ~halve disk throughput
POSIX FAILS: TRANSACTIONS
9
POSIX FAILS: ENUMERATIONCeph objects are distributed by a 32-bit hashEnumeration is in hash order
scrubbing “backfill” (data rebalancing, recovery) enumeration via librados client API
POSIX readdir is not well-orderedNeed O(1) “split” for a given shard/range
Build directory tree by hash-value prefix split any directory when size > ~100 files merge when size < ~50 files read entire directory, sort in-memory
…DIR_A/DIR_A/A03224D3_qwerDIR_A/A247233E_zxcv…DIR_B/DIR_B/DIR_8/DIR_B/DIR_8/B823032D_fooDIR_B/DIR_8/B8474342_barDIR_B/DIR_9/DIR_B/DIR_9/B924273B_bazDIR_B/DIR_A/DIR_B/DIR_A/BA4328D2_asdf…
NEWSTORE
11
NEW STORE GOALSMore natural transaction atomicityAvoid double writesEfficient object enumerationEfficient clone operationEfficient splice (“move these bytes from object X to object Y”)
Efficient IO pattern for HDDs, SSDs, NVMeMinimal locking, maximum parallelism (between PGs)
Advanced features full data and metadata checksums compression
12
NEWSTORE – WE MANAGE NAMESPACEPOSIX has the wrong metadata model for usOrdered key/value is perfect match
well-defined object name sort order efficient enumeration and random lookup
NewStore = rocksdb + object files /var/lib/ceph/osd/ceph-123/
db/ <rocksdb, leveldb, whatever>
blobs.1/ 0 1 ...
blobs.2/ 100000 100001 ...
HDD
OSD
SSD SSD
OSD
HDD
OSD
NewStore NewStore NewStoreRocksD
BRocksD
B
13
RocksDB has a write-ahead log “journal”XFS/ext4(/btrfs) have their own journal (tree-log)Journal-on-journal has high overhead
each journal manages half of overall consistency, but incurs same overhead
write(2) + fsync(2) to new blobs.2/103021 write + flush to block device
1 write + flush to XFS/ext4 journalwrite(2) + fsync(2) on RocksDB log
1 write + flush to block device 1 write + flush to XFS/ext4 journal
NEWSTORE FAIL: CONSISTENCY OVERHEAD
14
We can't overwrite a POSIX file as part of a atomic transactionWriting overwrite data to a new file means many files for each objectWrite-ahead logging?
put overwrite data in a “WAL” records in RocksDB commit atomically with transaction then overwrite original file data but we're back to a double-write of overwrites...
Performance sucks againOverwrites dominate RBD block workloads
NEWSTORE FAIL: ATOMICITY NEEDS WAL
BLUESTORE
16
BlueStore = Block + NewStore consume raw block device(s) key/value database (RocksDB) for metadata data written directly to block device pluggable block Allocator
We must share the block device with RocksDB implement our own rocksdb::Env implement tiny “file system” BlueFS make BlueStore and BlueFS share
BLUESTORE
BlueStore
BlueFS
RocksDB
BlockDeviceBlockDeviceBlockDevice
BlueRocksEnv
data metadata
Allo
cato
r
ObjectStore
17
ROCKSDB: BLUEROCKSENV + BLUEFSclass BlueRocksEnv : public rocksdb::EnvWrapper
passes file IO operations to BlueFSBlueFS is a super-simple “file system”
all metadata loaded in RAM on start/mount no need to store block free list coarse allocation unit (1 MB blocks) all metadata lives in written to a journal journal rewritten/compacted when it gets
largesuperblock journal …
more journal …data
datadata
data
file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ...
Map “directories” to different block devices
db.wal/ – on NVRAM, NVMe, SSD db/ –
level0 and hot SSTs on SSD db.slow/ – cold SSTs on HDD
BlueStore periodically balances free space
18
ROCKSDB: JOURNAL RECYCLINGProblem: 1 small (4 KB) Ceph write → 3-4 disk IOs!
BlueStore: write 4 KB of user data rocksdb: append record to WAL
write update block at end of log file fsync: XFS/ext4/BlueFS journals inode size/alloc update to its journal
fallocate(2) doesn't help data blocks are not pre-zeroed; fsync still has to update alloc metadata
rocksdb LogReader only understands two modes read until end of file (need accurate file size) read all valid records, then ignore zeros at end (need zeroed tail)
19
ROCKSDB: JOURNAL RECYCLING (2)Put old log files on recycle list (instead of deleting them)LogWriter
overwrite old log data with new log data include log number in each record
LogReader stop replaying when we get garbage (bad CRC) or when we get a valid CRC but record is from a previous log incarnation
Now we get one log append → one IO!
Upstream in RocksDB! but missing a bug fix (PR #881)
Works with normal file-based storage, or BlueFS
METADATA
21
BLUESTORE METADATAPartition namespace for different metadata
S* – “superblock” metadata for the entire store B* – block allocation metadata
C* – collection name → cnode_t O* – object name → onode_t or enode_t L* – write-ahead log entries, promises of future IO
M* – omap (user key/value data, stored in objects) V* – overlay object data (obsolete?)
22
Per object metadata Lives directly in key/value pair Serializes to ~200 bytes
Unique object id (like ino_t)Size in bytesInline attributes (user attr data)Block pointers (user byte data)
Overlay metadataOmap prefix/ID (user k/v data)
struct bluestore_onode_t { uint64_t nid; uint64_t size; map<string,bufferptr> attrs; map<uint64_t,bluestore_extent_t> block_map; map<uint64_t,bluestore_overlay_t> overlay_map; uint64_t omap_head;};
struct bluestore_extent_t { uint64_t offset; uint32_t length; uint32_t flags;};
ONODE
23
Collection metadata Interval of object namespace
shard pool hash name bitsC<NOSHARD,12,3d321e00> “12.e123d3” = <25>
shard pool hash name snap genO<NOSHARD,12,3d321d88,foo,NOSNAP,NOGEN> = …O<NOSHARD,12,3d321d92,bar,NOSNAP,NOGEN> = …O<NOSHARD,12,3d321e02,baz,NOSNAP,NOGEN> = …O<NOSHARD,12,3d321e12,zip,NOSNAP,NOGEN> = …O<NOSHARD,12,3d321e12,dee,NOSNAP,NOGEN> = …O<NOSHARD,12,3d321e38,dah,NOSNAP,NOGEN> = …
struct spg_t { uint64_t pool; uint32_t hash; shard_id_t shard;};
struct bluestore_cnode_t { uint32_t bits;};
Nice properties Ordered enumeration of objects We can “split” collections by adjusting metadata
CNODE
24
Extent metadata Sometimes we share blocks between objects (usually
clones/snaps) We need to reference count those extents We still want to split collections and repartition extent
metadata by hash struct bluestore_enode_t { struct record_t { uint32_t length; uint32_t refs; }; map<uint64_t,record_t> ref_map;
void add(uint64_t offset, uint32_t len, unsigned ref=2); void get(uint64_t offset, uint32_t len); void put(uint64_t offset, uint32_t len, vector<bluestore_extent_t> *release);};
ENODE
DATA PATH
26
TermsSequencer
An independent, totally ordered queue of transactions
One per PGTransContext
State describing an executing transaction
DATA PATH BASICSTwo ways to writeNew allocation
Any write larger than min_alloc_size goes to a new, unused extent on disk
Once that IO completes, we commit the transaction
WAL (write-ahead-logged) Commit temporary promise to
(over)write data with transaction includes data!
Do async overwrite Clean up temporary k/v pair
27
TRANSCONTEXT STATE MACHINE
PREPARE AIO_WAIT
KV_QUEUED KV_COMMITTING
WAL_QUEUED WAL_AIO_WAIT
FINISH
WAL_CLEANUP
Initiate some AIO
Wait for next TransContext(s) in Sequencer to be ready
PREPAREAIO_WAITKV_QUEUEDAIO_WAIT
KV_COMMITTINGKV_COMMITTINGKV_COMMITTING
FINISHWAL_QUEUEDWAL_QUEUEDFINISH
WAL_AIO_WAITWAL_CLEANUPWAL_CLEANUP
Sequencer queue
Initiate some AIO
Wait for next commit batch FINISHFINISHWAL_CLEANUP_COMMITTING
28
From open(2) man page
By default, all IO is AIO + O_DIRECT but sometimes we want to cache (e.g., POSIX_FADV_WILLNEED)
BlueStore mixes direct and buffered IO O_DIRECT read(2) and write(2) invalidate and/or flush pages... racily we avoid mixing them on the same pages disable readahead: posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM)
But it's not quite enough... moving to fully user-space cache
AIO, O_DIRECT, AND CACHING
Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file.
29
FreelistManager persist list of free extents to
key/value store prepare incremental updates for
allocate or releaseInitial implementation
<offset> = <length> keep in-memory copy enforce an ordering on commits
del 1600=100000 put 1700=99990
small initial memory footprint, very expensive when fragmented
New approach <offset> = <region bitmap> where region is N blocks (1024?) no in-memory state use k/v merge operator to XOR
allocation or release merge 10=0000000011 merge 20=1110000000
RocksDB log-structured-merge tree coalesces keys during compaction
BLOCK FREE LIST
30
Allocator abstract interface to allocate new
space
StupidAllocator bin free extents by size (powers of
2) choose sufficiently large extent
closest to hint highly variable memory usage
btree of free extents implemented, works based on ancient ebofs policy
BitmapAllocator hierarchy of indexes
L1: 2 bits = 2^6 blocks L2: 2 bits = 2^12 blocks ... 00 = all free, 11 = all used, 01 = mix
fixed memory consumption ~35 MB RAM per TB
BLOCK ALLOCATOR
31
Let's support them natively!256 MB zones / bands
must be written sequentially, but not all at once
libzbc supports ZAC and ZBC HDDs host-managed or host-aware
SMRAllocator write pointer per zone used + free counters per zone Bonus: almost no memory!
IO ordering must ensure allocated writes reach
disk in orderCleaning
store k/v hints zone → object hash
pick emptiest closed zone, scan hints, move objects that are still there
opportunistically rewrite objects we read if the zone is flagged for cleaning soon
SMR HDD
FANCY STUFF
33
Full data checksumsWe scrub... periodicallyWe want to validate checksum on every read
Compression3x replication is expensiveAny scale-out cluster is expensive
WE WANT FANCY STUFF
34
Full data checksumsWe scrub... periodicallyWe want to validate checksum on every read
More data with extent pointer 4KB for 32-bit csum per 4KB block bigger onode: 300 → 4396 bytes larger csum blocks?
Compression3x replication is expensiveAny scale-out cluster is expensive
Need largish extents to get compression benefit (64 KB, 128 KB)
overwrites need to do read/modify/write
WE WANT FANCY STUFF
35
INDIRECT EXTENT STRUCTURES
map<uint64_t,bluestore_lextent_t> extent_map;
struct bluestore_lextent_t { uint64_t blob_id; uint64_t b_off, b_len;};
map<uint64_t,bluestore_blob_t> blob_map;
struct bluestore_blob_t { vector<bluestore_pextent_t> extents; uint32_t logical_length; ///< uncompressed length uint32_t flags; ///< FLAGS_* uint16_t num_refs; uint8_t csum_type; ///< CSUM_* uint8_t csum_block_order; vector<char> csum_data;};
struct bluestore_pextent_t { uint64_t offset, length;};
onode_t
onode_t or enode_t
data on block device
onode may reference a piece of a blob,or multiple pieces
csum block size can vary
ref counted (when in enode)
PERFORMANCE
37
4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K0
50
100
150
200
250
300
350
400
450
500
Ceph 10.1.0 Bluestore vs Filestore Sequential Writes
FS HDDBS HDD
IO Size
Thro
ughp
ut (M
B/s
)
HDD: SEQUENTIAL WRITE
38
HDD: RANDOM WRITE
4K 8K 16K
32K
64K
128K
256K
512K
1024
K20
48K
4096
K0
50100150200250300350400450
Ceph 10.1.0 Bluestore vs Filestore Random Writes
FS HDDBS HDD
IO Size
Thro
ughp
ut (M
B/s
)
4K 8K 16K
32K
64K
128K
256K
512K
1024
K20
48K
4096
K0
200400600800
1000120014001600
Ceph 10.1.0 Bluestore vs Filestore Random Writes
FS HDDBS HDD
IO Size
IOPS
39
HDD: SEQUENTIAL READ
4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K0
200
400
600
800
1000
1200
Ceph 10.1.0 Bluestore vs Filestore Sequential Reads
FS HDDBS HDD
IO Size
Thro
ughp
ut (M
B/s
)
40
HDD: RANDOM READ
4K 8K 16K
32K
64K
128K
256K
512K
1024
K20
48K
4096
K0
200
400
600
800
1000
1200
1400
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDDBS HDD
IO Size
Thro
ughp
ut (M
B/s
)
4K 8K 16K
32K
64K
128K
256K
512K
1024
K20
48K
4096
K0
500
1000
1500
2000
2500
3000
3500
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDDBS HDD
IO Size
IOPS
41
SSD AND NVME?NVMe journal
random writes ~2x faster some testing anomalies (problem with test rig kernel?)
SSD only similar to HDD result small write benefit is more pronounced
NVMe only more testing anomalies on test rig.. WIP
42
AVAILABILITYExperimental backend in Jewel v10.2.z (just released)
enable experimental unrecoverable data corrupting features = bluestore rocksdb
ceph-disk --bluestore DEV no multi-device magic provisioning just yet
The goal... stable in Kraken (Fall '16)? default in Luminous (Spring '17)?
43
SUMMARY
Ceph is greatPOSIX was poor choice for storing objectsOur new BlueStore backend is awesomeRocksDB rocks and was easy to embedLog recycling speeds up commits (now upstream)Delayed merge will help too (coming soon)
THANK YOU!Sage Weil
CEPH PRINCIPAL ARCHITECT
@liewegas
As told byAllen Samuels
Engineering Fellow, SanDisk
Ceph Day, Portland 2016.5.25
45
Status Update (2016.5.25)Substantial re-implementation for enhanced features and performanceNew extent structure to easily support compression and checksums (PR published yesterday)
New disk format, NOT backward compatible with Tech PreviewOverlay / WAL mechanism rebuilt to separate policy from mechanismBitmap allocator to reduce memory usage and unnecessary backend serialization on flash
Extends KeyValueDB to utilize Merge conceptBlueFS log redesign to avoid stall on compactionUser-space block cacheGSoC student working on SMR supportSanDisk ZetaScale open sourced and being prepped as RocksDB replacement on flashGoal of being stable in K release is still in sight!
46
RockDB Tuning
• RockDB tuned for better performance• opt.IncreaseParallelism (16);• opt.OptimizeLevelStyleCompaction (512 * 1024 * 1024);
1 27 53 79 1051311571832092352612873133393653914174434694955215475735996256510
1000
2000
3000
4000
5000
60004K 100% write( RocksDB Tuned Vs Default options)
RocksDB(Tuned Options)RocksDB(Default Option)
Runtime
TPS
47
BlueStore performance RocksDB Vs ZetaScale
0/100 70/30 99/10
5
10
15
20
25
30
RBD 4K workload TPS
BlueStore with RocksDBBlueStore with ZetaScale
Read/Write Ratio
TPS
in K
s
Observation
• Bluestore with ZetaScale provides 2.2x improvement in read performance compared to RocksDB
• Bluestore with ZetaScale provides 20% improvement in mixed read/write compared to RocksDB
• Bluestore with ZetaScale100% write performance is 10% less than the RocksDB
48
BlueStore performance RocksDB Vs ZetaScale
1 31 61 91 1211511812112412713013313613914214514815115415716016316610
500
1000
1500
2000
2500
30004K 100% write TPS
RocksDB
ZetaScale
Run time
TPS
Observation
• ZetaScale performance is stable
• RocksDB performance is choppy due to compaction
• RocksDB write performance keeps degrading as data size grows because the compaction overhead grows with the data size
49
1 29 57 85 1131411691972252532813093373653934214494775055335615896170
1000
2000
3000
4000
5000
600070% read and 30 % write
RocksDB read
ZetaScale Reac
RocksDB write
ZetaScale write
Runtime
TPS
BlueStore performance RocksDB Vs ZetaScale
Observation
• ZetaScale read performance is stable
• RocksDB read performance is choppy due to compaction
50
Read/Write Ratio TPS in Ks IO util% CPU %Ios (rd+wr) Per sec
Read KB per sec
Write KB per sec
IOs (rd+wr) per TPS
CPU time in ms per TPS
Disk BW(rd+wr) KB per TPS
0/100 RocksDB 1.59 98 171 19094 106926 38290 12.0 1.1 91.3
0/100 ZetaScale 1.42 98 106 11970 28821 48877 8.4 0.7 54.7
70/30 RocksDB 3.59 100 233 24865 143867 33430 6.9 0.6 49.4
70/30 ZetaScale 4.28 98 178 16870 62331 44989 3.9 0.4 25.1
100/0 RocksDB 11.25 100 407 36370 215546 7300 3.2 0.4 19.8
100/0 ZetaScale 24.27 100 578 50109 291947 11537 2.1 0.2 12.5
BlueStore performance RocksDB Vs ZetaScale
Observation
• For the observed performance advantage with ZetaScale, the following are highlights of system resource utilizationIO : BlueStore with Zeta Scale uses up to 43% less IOs compared to RocksDB CPU : BlueStore with Zeta Scale uses 35% less cpu compared to RocksDBDisk BW : BlueStore with Zeta Scale uses up to 50% less bandwidth compared to RocksDB
51
Tasks in progress• Multiple Blocks performance
• 16K, 64K, 256K
• Large Scale testing • Multiple OSDs with 128TB capacity
• Runs with worst case metadata size• Onode size grows as test progress due to random writes • RocksDB performance keeps degrading as the dataset size
increases. • Measure performance under this condition and compare
• Tune ZetaScale for further improvement