RocksDB storage engine for MySQL and MongoDB

33
RocksDB Storage Engine Igor Canadi | Facebook

Transcript of RocksDB storage engine for MySQL and MongoDB

Page 1: RocksDB storage engine for MySQL and MongoDB

RocksDB Storage Engine Igor Canadi | Facebook

Page 2: RocksDB storage engine for MySQL and MongoDB

Overview

•  Story of RocksDB

•  Architecture

•  Performance tuning

•  Next steps

1

Page 3: RocksDB storage engine for MySQL and MongoDB

Story of RocksDB

Page 4: RocksDB storage engine for MySQL and MongoDB

Pre-2011

•  FB infrastructure – many custom-built key-value stores

•  LevelDB released

2

Page 5: RocksDB storage engine for MySQL and MongoDB

Experimentation (2011 – 2013)

•  First use-cases

•  Not designed for server – many bottlenecks, stalls

•  Optimization

•  New features

3

Page 6: RocksDB storage engine for MySQL and MongoDB

Explosion (2013 – 2015)

•  Open sourced RocksDB

•  Big success within Facebook

•  External traction – Linkedin, Yahoo, CockroachDB, …

4

Page 7: RocksDB storage engine for MySQL and MongoDB

New Challenges (2015 - )

•  Bring RocksDB to databases

5

Page 8: RocksDB storage engine for MySQL and MongoDB

MongoRocks

•  Running in production at Parse for 6 months

•  Huge storage savings (5TB à 285GB)

•  Document-level locking

6

Page 9: RocksDB storage engine for MySQL and MongoDB

MyRocks

7 InnoDB RocksDB

0

0.2

0.4

0.6

0.8

1

1.2

Database size (relative)

InnoDB

RocksDB

InnoDB RocksDB 0

0.2

0.4

0.6

0.8

1

1.2

Bytes written (relative)

InnoDB

RocksDB

Page 10: RocksDB storage engine for MySQL and MongoDB

Architecture Log Structured Merge Trees

Page 11: RocksDB storage engine for MySQL and MongoDB

Log Structured Merge Trees

8

(64MB)

(256MB)

(512MB)

(5GB)

(50GB)

(500GB)

Memtable

Level 0

Level 1

Level 2

Level 3

Level 4

Page 12: RocksDB storage engine for MySQL and MongoDB

Log Structured Merge Trees – write

9

(64MB)

(256MB)

Memtable

Level 0

(key,value)

Page 13: RocksDB storage engine for MySQL and MongoDB

Log Structured Merge Trees – flush

10

(64MB)

(256MB)

Memtable

Level 0

Page 14: RocksDB storage engine for MySQL and MongoDB

Log Structured Merge Trees – compaction

11

(5GB)

(50GB)

Level 2

Level 3

Page 15: RocksDB storage engine for MySQL and MongoDB

Writes

•  Foreground:

•  Writes go to memtable (skiplist) + write-ahead log

•  Background:

•  When memtable is full, we flush to Level 0

•  When a level is full, we run compaction

12

Page 16: RocksDB storage engine for MySQL and MongoDB

Reads

13

(64MB)

(256MB)

(512MB)

(5GB)

(50GB)

(500GB)

Memtable

Level 0

Level 1

Level 2

Level 3

Level 4

Page 17: RocksDB storage engine for MySQL and MongoDB

Reads

•  Point queries

•  Bloom filters reduce reads from storage

•  Usually only 1 read IO

•  Range scans

•  Bloom filters don’t help

•  Depends on amount of memory, 1-2 IO

14

Page 18: RocksDB storage engine for MySQL and MongoDB

RocksDB Files

15

rocksdb/> ls MANIFEST-000032 000024.log 000031.log 000025.sst 000028.sst 000029.sst 000033.sst 000034.sst LOG LOG.old.1441234029851978 ...

Page 19: RocksDB storage engine for MySQL and MongoDB

RocksDB Files – MANIFEST

16

(initial state) Add file 1 Add file 2 Add file 3 Add file 4 …

(flush) Add file 9 Mark log 6 persisted

(compaction) Add file 10 Add file 11 Remove file 9 Remove file 8

Add new column family “system”

•  Atomical updates to database metadata

Page 20: RocksDB storage engine for MySQL and MongoDB

RocksDB Files – Write-ahead log

17

Write (A, B) Write (C, D) Write (E, F)

Delete(A) Write(X, Y) Delete(C)

•  Persisted memtable state

Page 21: RocksDB storage engine for MySQL and MongoDB

RocksDB Files – Table files

18

(Data block) •  compressed •  prefix encoded

(Data block) <key, value>

(Data block) (Data block)

(Data block)

(Data block)

(Data block)

(Data block)

(Index block) <key, block>

(Filter block) (Statistics) (Meta index block) Pointers to blocks

Page 22: RocksDB storage engine for MySQL and MongoDB

RocksDB Files – LOG files

•  Debugging output

•  Tuning options

•  Information about flushes and compactions

•  Performance statistics

19

Page 23: RocksDB storage engine for MySQL and MongoDB

Backups

•  Table files are immutable

•  Other files are append-only

•  Easy and fast incremental backups

•  Open sourced Rocks-Strata

20

Page 24: RocksDB storage engine for MySQL and MongoDB

Performance tuning

Page 25: RocksDB storage engine for MySQL and MongoDB

Tombstones

•  Deletions are deferred

•  May cause higher P99 latencies

•  Be careful with pathological workloads, e.g. queues

21

Page 26: RocksDB storage engine for MySQL and MongoDB

Caching

22

Block cache •  Managed by RocksDB •  Uncompressed data •  Defaults to 1/3 of RAM

Page cache •  Managed by kernel •  Compressed data

Page 27: RocksDB storage engine for MySQL and MongoDB

Memory usage

•  Block cache

•  Index and filter blocks (0.5 – 2% of the database)

•  Memtables

•  Blocks pinned by iterators

23

Page 28: RocksDB storage engine for MySQL and MongoDB

Reduce memory usage

•  Reduce block cache size – will increase CPU

•  Increase block size – decrease index size

•  Turn off bloom filters on bottom level

24

Page 29: RocksDB storage engine for MySQL and MongoDB

Reduce CPU

•  Profile the CPU usage

•  Increase block cache size – will increase memory usage

•  Turn off compression

•  It might be tombstones

25

Page 30: RocksDB storage engine for MySQL and MongoDB

Reduce write amplification

•  Write amplification = 5 * num_levels

•  Increase memtable and level 1 size

•  Stronger (zlib, zstd) compression for bottom levels

•  Try universal compaction

26

Page 31: RocksDB storage engine for MySQL and MongoDB

Next steps

Page 32: RocksDB storage engine for MySQL and MongoDB

Next steps

•  Increase performance & stability

•  Deploy MyRocks at Facebook

•  External adoption of MyRocks and MongoRocks

•  Build an ecosystem

27

Page 33: RocksDB storage engine for MySQL and MongoDB

Thank you