Post on 31-Jan-2021
Write Optimization of Log-structured Flash File System for Parallel I/O on Manycore Servers
Chang-Gyu Lee, Hyunki Byun, Sunghyun Noh, Hyeongu Kang, Youngjae KimDepartment of Computer Science and Engineering
Sogang University, Seoul, Republic of Korea
SYSTOR ‘19
1
Data Intensive Applications Massive data explosion in recent years and expected to grow
Database Applications
2007,281 EB
2010,1.2 ZB
2013,4.4 ZB
2020,~44 ZB
Growing Capacity Demands
Storage
Memory
2
Parallel Writes
Manycore CPU and NVMe SSD
Manycore Servers
High-Performance SSD
Parallel WritesOS File System
(F2FS)
3
What are Parallel Writes? Shared File Writes (DWOM from FxMark[ATC’16])
Multiple processes write private regions on a single file.
Process 1 Process 2 Process 3 Process N
Private File Write with FSYNC (DWSL from FxMark[ATC’16]) Multiple processes write private files, then call fsync system calls.
Direct I/O Write
Process 1 Process 2 Process 3 Process NWrite and fsync
Shared File
Private Files
* FxMark[ATC’16]: Min. et. al., "Understanding Manycore Scalability of File Systems", USENIX ATC 2016 4
Preliminary Results
5
1 15 28 42 56 70 84 98 112120
# of Cores
0
50
100
150
200
K IO
PS
1 15 28 42 56 70 84 98 112120
# of Cores
0
50
100
150
200
K IO
PS
In DWOM workload, the performance does not scale.
In DWSL workload, the performance does not scale after 42 cores.
Contents Introduction and Motivation Background: F2FS Research Problems
Parallel Writes do never scale with respect to the increased number of cores on Manycore servers.
Approaches Applying Range-Locking NVM Node Logging for file and file system metadata Pin-Point Update to completely eliminate checkpointing
Evaluation Results Conclusion
6
F2FS: Flash Friendly File System F2FS is a log-structured file system designed for NAND Flash SSD. F2FS employs two types of logs to benefit with Flash’s parallelism and garbage
collection. Data log for directory entry and user data Node log for inode and indirect node
Node Address Table (NAT) translates Node id (NID) to block address. In memory, block address of an NAT entry is updated when corresponding Node
Log is flushed. Entire NAT is flushed to the storage device during checkpointing.
CP NAT SIT SSA Node Log Data Log
Filesystem Metadata(Random write)
Main Log Area(Sequential Write)
7
Problem(1): Serialized Shared File Writes Single file write
A B C
Inode
File
Blocked
Grant Lock
8
Problem(2): fsync Processing in F2FS
inode Data
Node Log Data Log
Node id Block
NAT
DRAM ❷ ❶Datainode
Node id Block
NAT
SSD
Old Data
ReferenceFlushing
New Data
9
Problem(3): I/O Blocking during Checkpointing
inode Data
Node Log Data Log
Node id Block
NAT
DRAM ❷ ❶Datainode
Node id Block
NAT
SSD
Old Data
ReferenceFlushing
New Data
60 Sec.
Checkpointing
10
Problem(3): I/O Blocking during Checkpointing
inode Data
Node Log Data Log
Node id Block
NAT
DRAM ❷❸ ❶Datainode
Node id Block
NAT
SSD
Old Data
ReferenceFlushing
New Data
60 Sec.
Checkpointing
11
Problem(3): I/O Blocking during Checkpointing
inode Data
Node Log Data Log
Node id Block
NAT
DRAM ❷❸ ❶Datainode
Node id Block
NAT
SSD
User LevelFilesystem Level
Old Data
ReferenceFlushing
New Data
12
Summary
We identified the causes of bottlenecks in F2FS for parallel writes as follows.
1. Serialization of parallel writes on a single file
2. High latency of fsync system call
3. I/O blocking by checkpointing of F2FS
13
Approach(1): Range Locking In F2FS, parallel writes to a single file are serialized by inode mutex
lock.
B, ref=0
C, ref=1
A B C
Inode
File
A, ref=0
Grant Lock
Grant Lock
Block
We employed a range-based lock to allow parallel writes on a single file.
14
Approach(2): High Latency of fsync Processing When fsync is called, F2FS has to flush data and metadata.
Even if only small portion of metadata is changed, a block has to be flushed. The latency of fsync is dominated by block I/O latency.
inodeSlow Block I/O
Write Amplification
DRAMSSD
To mitigate high latency of fsync, we propose NVM Node Logging and fine-graind inode.
inodeBetter Latency
Byte-addressability
DRAMNVM
15
Approach(2): Node Logging on NVM
inode Data
Node Log Data Log
Node id Block
NAT
DRAM ❷ ❶Datainode
Node id Block
NAT
SSDNVM
Old Data
ReferenceFlushing
New Data
16
Approach(3): Fine-grained inode Structure
17
inode
Address
AddressSpace
nid
4KB Data
Double Indirect
Indirect NodeDirect Node
inode
Addressnid
0.4KB
Data
inode in baseline F2FS Fine-grained inode
Approach(4): Pin-Point NAT Update Frequent fsync calls trigger checkpointing in F2FS However, F2FS blocks all incoming I/O requests during checkpointing.
To eliminate checkpointing, we propose Pin-Point NAT Update.
18
Approach(4): Pin-Point NAT Update
inode Data
Node Log Data Log
Node id Block
NAT
DRAM ❷❸ ❶Datainode
Node id Block
NAT
SSDNVM
Old Data
ReferenceFlushing
New DataIn Pin-Point NAT Update, we update only the modified NAT entry directly in NVM when fsync is called. Therefore, checkpointing is not necessary to persist the entire NAT.
19
Approach(4): Pin-Point NAT Update
inode Data
Node Log Data Log
Node id Block
NAT
DRAM ❷❸ ❶Datainode
Node id Block
NAT
SSDNVM
Old Data
ReferenceFlushing
New Data
20
Evaluation Setup Microbenchmark (FxMark)
DWOM Shared File Write
DWSL Private File Write with fsync CPU
Intel Xeon E7-8870 v2 2.3GHz8 CPU Nodes (15 Cores per Node)Total 120 cores
RAM 740GB
SSD Intel SSD 750 Series 400GB (NVMe)Read: 2200 MB/s, Write: 900 MB/s
NVM 32GB Emulated as PMEM device on RAMOS Linux kernel 4.14.11
Test-bed IBM x3950 X6 Manycore Server
* FxMark[ATC’16]: Min. et. al., "Understanding Manycore Scalability of File Systems", USENIX ATC 2016 21
Shared File Write (DWOM Workload)
0
20
40
60
80
100
120
140
1 15 28 42 56 70 84 98 112 120
K IO
PS
# of Cores
baseline range lock node logging integrated
X15
X6.8• Baseline and node logging lines overlap. • Node Logging does not help at all because DWOM
workload does not carry fsync calls.
22
Frequent fsync (DWSL Workload)
0
50
100
150
200
250
1 15 28 42 56 70 84 98 112 120
K IO
PS
# of Cores
baseline range lock node logging integrated
X1.6
23
Conclusion We identified performance bottlenecks of F2FS for parallel writes.
1. Serialization of share file writes on a file 2. High latency of fsync operations in F2FS 3. High I/O blocking times during checkpointing.
To solve these problem, we proposed 1. File-level Range Lock to allow parallel writes on a shared file2. NVM Node Logging to provides lower latency for updating file/file system
metadata3. Pin-Point NAT Update to eliminate I/O blocking times of checkpointing
24
Q&A
25
Thank you!
Contact: Changgyu Lee (changgyu@sogang.ac.kr)Department of Computer Science and EngineeringSogang University, Seoul, Republic of Korea
mailto:changgyu@sogang.ac.kr)