Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of...

Post on 27-Sep-2018

239 views 0 download

Transcript of Understanding Manycore Scalability of File Systems … · Understanding Manycore Scalability of...

UnderstandingManycore Scalability

of File Systems

Changwoo Min, Sanidhya Kashyap, Steffen MaassWoonhak Kang, and Taesoo Kim

Application must parallelizeI/O operations

● Death of single core CPU scaling– CPU clock frequency: 3 ~ 3.8 GHz– # of physical cores: up to 24 (Xeon E7 v4)

● From mechanical HDD to flash SSD– IOPS of a commodity SSD: 900K– Non-volatile memory (e.g., 3D XPoint): 1,000x ↑

But file systems become a scalability bottleneck

3

Problem: Lack of understandingin internal scalability behavior

0k

2k

4k

6k

8k

10k

12k

14k

0 10 20 30 40 50 60 70 80

mes

sage

s/se

c

#core

Exim mail server on RAMDISK

btrfsext4

F2FSXFS

3. Never scale

Embarrassingly parallelapplication!

1. Saturated

2. Collapsed

● Intel 80-core machine: 8-socket, 10-core Xeon E7-8870● RAM: 512GB, 1TB SSD, 7200 RPM HDD

4

Even in slower storage mediumfile system becomes a bottleneck

0k

2k

4k

6k

8k

10k

12k

btrfs ext4 F2FS XFS

mes

sage

s/se

c

Exim email server at 80 cores

RAMDISKSSD

HDD

5

Outline

● Background● FxMark design

– A file system benchmark suite for manycore scalability

● Analysis of five Linux file systems● Pilot solution● Related work● Summary

6

Research questions

● What file system operations are not scalable?

● Why they are not scalable?

● Is it the problem of implementation or design?

7

Technical challenges

● Applications are usually stuck with a few bottlenecks

→ cannot see the next level of bottlenecks before resolving them

→ difficult to understand overall scalability behavior

● How to systematically stress file systems to understand scalability behavior

8

FxMark: evaluate & analyze manycore scalability of file systems

FxMark:

File systems:

Storage medium:

tmpfs

Memory FS

ext4J/NJ

XFS

Journaling FS

btrfs

CoW FS

F2FS

Log FS

19 micro-benchmarks 3 applications

# core: 1, 2, 4, 10, 20, 30, 40, 50, 60, 70, 80

SSD

9

FxMark: evaluate & analyze manycore scalability of file systems

FxMark:

File systems:

Storage medium:

tmpfs

Memory FS

ext4J/NJ

XFS

Journaling FS

btrfs

CoW FS

F2FS

Log FS

19 micro-benchmarks 3 applications

# core: 1, 2, 4, 10, 20, 30, 40, 50, 60, 70, 80

SSD

>4,700

10

Low Medium High

Microbenchmark:unveil hidden scalability bottlenecks● Data block read

Shar

ing

Leve

l

File Block Process

ROperation

Legend:

R R R R R R

11

Stress different componentswith various sharing levels

12

Legend

btrfsext4

ext4NJF2FS

tmpfsXFS

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

Evaluation

● Data block read

Low:

R R

File systems:

Storage medium:

Linear scalability

13

Outline

● Background● FxMark design● Analysis of five Linux file systems

– What are scalability bottlenecks?

● Pilot solution● Related work● Summary

14

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL

Summary of results:file systems are not scalable

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

DRBH

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWAL

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWTL

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWSL

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPL

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPM

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MRPH

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDL

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWRL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MWRM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM:O_DIRECT

0k

10k

20k

30k

40k

50k

60k

70k

80k

90k

100k

0 10 20 30 40 50 60 70 80

mes

sage

s/se

c

#core

Exim

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

RocksDB

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80

GB/

sec

#core

DBENCH Legend

btrfsext4

ext4NJF2FS

tmpfsXFS

15

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL

Summary of results:file systems are not scalable

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

DRBH

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWAL

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWTL

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWSL

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPL

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPM

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MRPH

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDL

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWRL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MWRM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM:O_DIRECT

0k

10k

20k

30k

40k

50k

60k

70k

80k

90k

100k

0 10 20 30 40 50 60 70 80

mes

sage

s/se

c

#core

Exim

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

RocksDB

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80

GB/

sec

#core

DBENCH Legend

btrfsext4

ext4NJF2FS

tmpfsXFS

16

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL

Summary of results:file systems are not scalable

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

DRBH

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWAL

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWTL

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWSL

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPL

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPM

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MRPH

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDL

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWRL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MWRM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM:O_DIRECT

0k

10k

20k

30k

40k

50k

60k

70k

80k

90k

100k

0 10 20 30 40 50 60 70 80

mes

sage

s/se

c

#core

Exim

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

RocksDB

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80

GB/

sec

#core

DBENCH Legend

btrfsext4

ext4NJF2FS

tmpfsXFS

17

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL

Summary of results:file systems are not scalable

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

DRBH

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWAL

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWTL

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWSL

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPL

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRPM

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MRPH

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDL

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MRDM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWCM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWUM

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

MWRL

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

MWRM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL:O_DIRECT

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM:O_DIRECT

0k

10k

20k

30k

40k

50k

60k

70k

80k

90k

100k

0 10 20 30 40 50 60 70 80

mes

sage

s/se

c

#core

Exim

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

RocksDB

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80

GB/

sec

#core

DBENCH Legend

btrfsext4

ext4NJF2FS

tmpfsXFS

18

Data block read

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80M

ops

/sec

#core

DRBL

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBH

Low:

Medium:

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBM

All file systems linearly scale

All file systems show performance collapse

XFS shows performance collapse

XFS

R R

R R

High: R R

19

Disk

Page cache is maintained for efficient access of file data

R

1. read a file block

Page cache

2. look up a page cache

3. cache miss

4. read a page from disk

5. copy page

OS Kernel

20

Disk

Page cache hit

R1. read a file block

Page cache

2. look up a page cache

3. cache hit4. copy page

OS Kernel

21

Disk

Page cache can be evictedto secure free memory

Page cache

OS Kernel

22

Disk

… only when not being accessed

R1. read a file block

Page cache

OS Kernel

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);}

Reference counting is usedto track # of accessing tasks

4. copy page

23

Reference counting becomes a scalability bottleneck

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);} 0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBH

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);}

R R R R R R

...100 CPI

(cycles-per-instruction)

20 CPI(cycles-per-instruction)

24

Reference counting becomes a scalability bottleneck

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);} 0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DRBH

access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);}

R R R R R R

...

High contention on a page reference counter → Huge memory stall

100 CPI(cycles-per-instruction)

20 CPI(cycles-per-instruction)

Many more: directory entry cache, XFS inode, etc

25

Lessons learned

High locality can cause performance collapse

Cache hit should be scalable → When the cache hit is dominant,

the scalability of cache hit does matter.

26

Data block overwrite

Low:

Medium:

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

All file systems degrade gradually

Ext4, F2FS, and btrfs show performance collapse

ext4 F2FSbtrfs

W W

W W

Btrfs is a copy-on-write (CoW)file system

● Directs a write to a block to a new copy of the block

→ Never overwrites the block in place

→ Maintain multiple versions of a file system imageW

Time T Time T Time T+1

28

CoW triggers disk block allocation for every write

W

Time T Time T+1

BlockAllocation

BlockAllocation

Disk block allocation becomes a bottleneck

Ext4 journaling, F2FS checkpointing→ →

29

Lessons learned

Overwriting could be as expensive as appending → Critical at log-structured FS (F2FS) and CoW FS (btrfs)

Consistency guarantee mechanisms should be scalable

→ Scalable journaling → Scalable CoW index structure → Parallel log-structured writing

Data block overwrite

Low:

Medium:

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

M o

ps/s

ec

#core

DWOM

All file systems degrade gradually

Ext4, F2FS, and btrfs show performance collapse

ext4 F2FSbtrfs

W W

W W

Entire file is lockedregardless of update range

● All tested file systems hold an inode mutex for write operations– Range-based locking is not implemented

***_file_write_iter(...) { mutex_lock(&inode->i_mutex); ... mutex_unlock(&inode->i_mutex);}

Lessons learned

A file cannot be concurrently updated– Critical for VM and DBMS, which manage large files

Need to consider techniques used in parallel file systems

→ E.g., range-based locking

33

Summary of findings

● High locality can cause performance collapse● Overwriting could be as expensive as appending● A file cannot be concurrently updated● All directory operations are sequential● Renaming is system-wide sequential● Metadata changes are not scalable● Non-scalability often means wasting CPU cycles● Scalability is not portable

See our paper

34

Summary of findings

● High locality can cause performance collapse● Overwriting could be as expensive as appending● A file cannot be concurrently updated● All directory operations are sequential● Renaming is system-wide sequential● Metadata changes are not scalable● Non-scalability often means wasting CPU cycles● Scalability is not portable

See our paper

Many of them are unexpected and counter-intuitive → Contention at file system level to maintain data dependencies

35

Outline

● Background● FxMark design● Analysis of five Linux file systems● Pilot solution

– If we remove contentions in a file system,

is such file system scalable?

● Related work● Summary

36

RocksDB on a 60-partitioned RAMDISK scales better

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80op

s/se

c#core

btrfsext4F2FS

tmpfsXFS

tmpfs

A single-partitioned RAMDISK

A 60-partitioned RAMDISK

2.1x

** Tested workload: DB_BENCH overwrite **

37

RocksDB on a 60-partitioned RAMDISK scales better

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80

ops/

sec

#core

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80op

s/se

c#core

btrfsext4F2FS

tmpfsXFS

tmpfs

A single-partitioned RAMDISK

A 60-partitioned RAMDISK

2.1x

** Tested workload: DB_BENCH overwrite **

Reduced contention on file systemshelps improving performance and scalability

38

0

100

200

300

400

500

0 5 10 15 20

ops/

sec

#core

But partitioning makesperformance worse on HDD

A single-partitioned HDD

A 60-partitionedHDD

2.7x

0

100

200

300

400

500

0 5 10 15 20

btrfsext4F2FSXFS

F2FS

** Tested workload: DB_BENCH overwrite **

39

0

100

200

300

400

500

0 5 10 15 20

ops/

sec

#core

But partitioning makesperformance worse on HDD

A single-partitioned HDD

A 60-partitionedHDD

2.7x

0

100

200

300

400

500

0 5 10 15 20

btrfsext4F2FSXFS

F2FS

** Tested workload: DB_BENCH overwrite **

But reduced spatial locality degrades performance → Medium-specific characteristics (e.g., spatial locality)

should be considered

40

Related work

● Scaling operating systems– Mostly use memory file system to opt out the effect of I/O

operations

● Scaling file systems– Scalable file system journaling

● ScaleFS [MIT:MSThesis'14]● SpanFS [ATC'15]

– Parallel log-structured writing on NVRAM● NOVA [FAST'16]

41

Summary

● Comprehensive analysis of manycore scalability of five widely-used file systems using FxMark

● Manycore scalability should be of utmost importance in file system design

● New challenges in scalable file system design– Minimizing contention, scalable consistency guarantee, spatial

locality, etc.

● FxMark is open source– https://github.com/sslab-gatech/fxmark