BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast...

47
BetrFS: A Right-Optimized Write-Optimized File System Amogh Akshintala, Michael Bender, Kanchan Chandnani, Pooja Deo, Martin Farach-Colton, William Jannen, Rob Johnson, Zardosht Kasheff, Bradley C. Kuszmaul, Prashant Pandey, Donald E. Porter, Leif Walsh, Jun Yuan, Yang Zhan Facebook, Farmingdale College, MIT & Oracle, Rutgers, Stony Brook, Two Sigma, UNC, Williams College

Transcript of BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast...

Page 1: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

BetrFS: A Right-Optimized Write-Optimized File System

Amogh Akshintala, Michael Bender, Kanchan Chandnani, Pooja Deo, Martin Farach-Colton, William Jannen, Rob Johnson, Zardosht Kasheff, Bradley C. Kuszmaul, Prashant Pandey, Donald E. Porter, Leif Walsh, Jun Yuan, Yang

Zhan

Facebook, Farmingdale College, MIT & Oracle, Rutgers, Stony Brook, Two Sigma, UNC, Williams College

Page 2: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

• Sequential reads• Sequential writes• Random writes• File/directory renames• File deletes• Recursive scans• Metadata updates

General-purpose file-systems strive to perform well on a wide variety of applications

Page 3: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

• Sequential reads• Sequential writes• Random writes• File/directory renames• File deletes• Recursive scans• Metadata updates

Achieving good performance on all these operations is a long-standing challenge

Example: ext4

Page 4: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

• Sequential reads• Sequential writes• Random writes• File/directory renames• File deletes• Recursive scans• Metadata updates

Achieving good performance on all these operations is a long-standing challenge

Logging updates is fast, but logged data can have little

locality

Example: log-based file systems

Page 5: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Some operations seem to require a trade-off

sequential reads random writes

directory scans renames

update-in-place log structured

inodes full-path indexing

Page 6: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

BetrFS

Main idea of this talk

Page 7: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Write-Optimized Data Structures (WODS)

● New class of data structures discovered in 90's– LSM trees [O'Neil, Cheng, Gawlick, & O'Neil '96]

– Bε-trees [Brodal & Fagerberg '03]

– COLAs [Bender, Farach-Colton, Fineman, Fogel, Kuszmaul, & Nelson '07]

– xDicts [Brodal, Demaine, Fineman, Iacono, Langerman, & Munro '10]

● WODS perform inserts/updates/deletes orders-of-magnitude-faster than in a B-tree– WODS queries are asymptotically no slower than in a B-tree

BetrFS usesBε-trees

Page 8: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

How computation works: •Data is transferred in blocks between RAM and disk. •The number of block transfers dominates the running time.

Goal: Minimize # of block transfers•Performance bounds are parameterized by block size B,

memory size M, data size N.

The Disk-Access Machine (DAM) model [Aggarwal & Vitter '88]

RAM Disk

B

B

M

Page 9: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

…. children ….

…......................................................

O(logB N )

Example: B-trees

B

B

B B

B B

Insert cost: O ( logB N ) Lookup cost: O ( logB N )

Page 10: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

B…. children ….

O(log√B N )

Bε-trees (ε=1/2)

√B

√B

B−√B

√B

B−√B

√B

B−√B

…......................................................B B

buffer space

Insert cost: O( √BB−√B

log√B N )=O( logB N

√B ) Lookup cost: O(log√B N )=O(logB N )

Range query cost: O(logB N+k /B)

Inserts are orders-of-magnitude faster than in a B-tree

Range queries canrun at disk bandwidth

Point queries are asymptotically as fast

as in a B-tree

Page 11: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

11

The search-insert asymmetry

● Inserts are orders-of-magnitude faster than point queries

● But many updates require querying the old value first● e.g. “Add $10 to rob's account balance”

– OldBalance = query(rob)– NewBalance = oldBalance + $10– insert(newBalance)

Point query:

Insert: O( logB N

√B )

O (logB N )

Page 12: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

B…. children ….

O(log√B N )

Upserts: read-modify-write as fast as an insert

√B

√B

B−√B

√B

B−√B

√B

B−√B

…......................................................B B

buffer space

rob: $5

rob:add $10

rob: $15

Page 13: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Bε-tree performance summary

Point query

Insert/delete/upsert

Range query

O( logB N

√B )

O ( logB N )

O (logB N+k /B )

Very fast (10K-100K per second)

As fast as a B-tree

To get the best possible performance, we want to do

Inserts, deletes, upserts, and range queries, and avoid point queries.

Near disk bandwidth

Page 14: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

• Maintain two separate Bε-tree indexes:

metadata index: full path ­> struct statdata index: (full path, blk#) ­> data[4096]

• Implications: Fast directory scansData blocks are laid out sequentially

The BetrFS schema (version 0.1)

Page 15: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Mapping file-system operations to key-value operations

read

write

metadata update

readdir

mkdir/rmdir

unlink

rename

range query

insert/upsert

upsert

range query

insert/delete

*delete each block

*delete then reinsert each block

Operation Implementation

Fast atime updates

Fast directorytraversals

Do not map ontosingle key-value

operations

Page 16: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

0.1

1

10

100

*lower is better

Time (

s) BetrFSbtrfsext4xfszfs

1000 Random 4−byte writes

Small, random, unaligned writes are an order-of-magnitude faster

●1 GiB file, random data

●1,000 random 4-byte writes

●fsync() at endLog scale

0.17sec vs > 10sec

BetrFS random writes

benefit from Bε-treeinsertion performance

Page 17: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

100

1000

10000

100000

0 1M 2M 3MFiles Created

*higher is better

Files

/seco

nd

BetrFSbtrfsext4xfszfs

Small File Creation

Small file creates are an order-of-magnitude faster

●Create 3 million files and write

200-bytes to each

●Balanced directory tree with

fanout 128

Log scale

BetrFS file creates

benefit from Bε-treeinsertion performance

Page 18: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Sequential I/O

0

25

50

75

100

read writeOperation

*higher is better

MiB/s

BetrFSbtrfsext4xfszfs

1GiB Sequential I/O• Write random data to file, 10

4K-blocks at a time

• Sequentially read data back

BetrFS sequential reads benefit from Bε-tree range

query performance

Mostly overhead of full-data journaling

(we'll fix this later in the talk)

Page 19: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

0

20

40

60

80

Time

(s)

BetrFSbtrfsext4xfszfs

grep −r

0

5

10

15

20

Time (

s)

GNU Find

Recursive directory traversals

• Recursive scans from root of

Linux 3.11.10 source

• GNU find scans file metadata

• grep –r scans file contents

BetrFS directory traversals

benefit from Bε-treerange-query performance

Lower is better

About 3-8x fasterthan other file systems

Page 20: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

File deletion

0

100

200

300

File Size

Tim

e (s

)

BetrFS

BetrFS Delete Scaling

• Write random data to file,

fsync() it

• Delete file

BetrFS deletes require O(n) key-value

operations

Lower is better

Page 21: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Directory rename

Lower is better

BetrFS renames require O(n) key-value

operations

Renames areorders-of-magnitude

slower than in ext4

Page 22: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

• Sequential reads• Sequential writes• Random writes• File/directory renames• File deletes• Recursive scans• Metadata updates

BetrFS (version 0.1) performance summary

Let's fix these problems

Page 23: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Accelerating rename without slowing down directory traversals

Page 24: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Full-path indexing yields fast directory scans

Example: grep -r “key” /home/rob/doc/

Disk (physical)Directory Tree (logical)

/home/rob/doc/home/rob/doc/latex/home/rob/doc/latex/a.tex/home/rob/doc/latex/b.tex/home/rob/doc/bar.c/home/rob/local

….

….

….

home

roblocal

2.jpg

videodoc

late

x 1.mp4

a.te

x b.texbar.c

disk head

Page 25: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

home

roblocal

2.jpg

videodoc

late

x 1.mp4

a.te

x b.texbar.c

Rename is expensive when using full-path indexing

/home/rob/doc

/home/rob/doc/bar.c/home/rob/local

….

….

….

Example: mv /home/rob/doc/latex /home/rob

/home/rob/latex/a.tex/home/rob/latex/b.tex

/home/rob/latex

/home/rob/doc/latex/b.tex/home/rob/doc/latex/a.tex

Disk (physical)Directory Tree (logical)

/home/rob/doc/latex

late

x

Page 26: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

The tension between fast rename and fast scanZoning: balancing indirection and locality

Scan cost

Rename cost

Indirection LocalityZones

Page 27: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Implication: Recursive directory scans only perform seeks when

crossing zones

BetrFS v0.2 rethinking the schema

● Partition file system into zones

● Use full-path indexing within zones

● Use inodes between zones

Zone: a subtree of the directory hierarchy

home

roblocal

2.jpg

videodoc

1.mp4la

tex

a.te

x b.texbar.c

Zone 1Zone 2Zone 0

Page 28: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

BetrFS v0.2 rethinking the schemaMoving the root of a zone is cheap

home

roblocal

2.jpg

videodoc

1.mp4la

tex

a.te

x b.texbar.c

Zone 1Zone 2

Example:mv /home/rob/video/1.mp4 /home/rob/doc

1.mp4

Zone 0

Page 29: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

BetrFS v0.2 rethinking the schema

home

roblocal

2.jpg

videodoc

late

x

a.te

x b.texbar.c

Zone 1Zone 2

Example:mv /home/rob/doc/latex /home/rob/latex

1.mp4

Zone 0

Renaming a subtree of a zone requires copying

late

x

Page 30: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

BetrFS v0.2 rethinking the schema

home

roblocal

2.jpg

videodoca.

tex b.tex

bar.c

Zone 1Zone 2

1.mp4

Zone 0

Managing zone sizes

late

x

Large zones → fast directory scansSmall zones → fast renames

We can keep zone sizes in a “sweet spot” by splitting large zones and merging small zones

Page 31: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

How big should zones be?

BetrFS-0.2 uses 512KB zones to balance rename and scan performance

Cost of renaming root of a zone

Cost of renaming via copy

Page 32: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

BetrFS 0.2 rename performance

21 s

econ

ds

Rename linux source tree

Rename performance iscomparable to other

file systems

Page 33: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Performing sequential writes at disk bandwidth and with full data

journaling semantics

Page 34: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

BetrFS version 0.1 writes everything twice

34

(k2, )

(k1, )

root

a b

root’

b’

Logtime

insert(k2, )

v2

v1

v2insert(k1, )v1

Page 35: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

BetrFS version 0.2: late-binding journal

35

(k2, )

unbound(k1, )

root

a b

root’

b’

Logtime

insert(k2, )

v2

v1

v2Unbound-insert(k1) . . . bind( , )

Page 36: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Why don't we use late-binding for small writes?

● Reason 1:

– Late-binding requires writing out a large (e.g. 4MB) node

– For small writes, this is huge write-amplification

– It's more efficient to make small writes durable by simply logging them● Reason 2:

– Small, random writes get written to disk several times as they get flushed down the Bε-tree

– So writing them to the log is not a big extra cost

Page 37: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Late-binding journal:performance evaluation

Fast sequential writeswith full data journaling

Sequential write

Page 38: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Rangecast delete

Delete /foo/*

/bar//fo

o/a /foo/x

/goo/a

Garbagecollected

File deletions require inserting a single message

andenable efficient

garbage collection

Page 39: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Rangecast delete performance evaluation

Constant latency,about 30% faster than ext4

File delete

Page 40: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Is BetrFS still fast at other operations?

Page 41: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

BetrFS still performsrandom writes orders ofmagnitude faster than

other file systems

BetrFS still has metadataupdates almost 100x

faster than ext4

Zone splits

Page 42: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

What about real application performance?

Page 43: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

git clone git diff

Macrobenchmark: git

Performance comparable to other file systems Recursive scan performance

pays off in real applications

Page 44: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Macrobenchmark: dovecot imap maildir workload

Payoff of improved delete andrename performance

Page 45: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

• Sequential reads• Sequential writes• Random writes• File/directory renames• File deletes• Recursive scans• Metadata updates

BetrFS (version 0.2) performance summary

Page 46: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

Conclusion● Write-optimized data structures can enable us to overcome long-standing file-system design trade-offs

● Write-optimized file systems can offer across-the-board top-of-the-line performance

● Write-optimization creates a need/opportunity to revisit many file system design issues

● Code available atbetrfs.org

Page 47: BetrFS: A Right-Optimized Write-Optimized File System · Point queries are asymptotically as fast as in a B-tree. 11 The search-insert asymmetry ... 1000 10000 100000 0 1M 2M 3M Files

SSD performance preview

Still work to do

6x speedup