Ben Walker Data Center Group Intel Corporation.…Notices and Disclaimers Intel technologies’...
-
Upload
trinhnguyet -
Category
Documents
-
view
219 -
download
0
Transcript of Ben Walker Data Center Group Intel Corporation.…Notices and Disclaimers Intel technologies’...
Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
No computer system can be absolutely secure.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2017 Intel Corporation.
What does a filesystem do?
Directories
PermissionsAccess Times Byte Granularity
Checksums
Snapshots
TRIMSparse Allocation
Caching
I/O Scheduling
RAIDPartitions
6
What sort of application benefits from SPDK?
Lots of I/O
Latency Sensitive
SAN? Database? Cache?
We picked two use cases:
RocksDB
Dynamic Block Allocation
9
Log-structured merge tree
Written in C++, Open Source
Pluggable storage backend
Broadly adopted
Recommends XFS
Makes minimal use of XFS
Directory structure
I/O pattern
Minimal caching needs
RocksDB
No other file system features required!
10
Glossary Of Terms File: Array of bytes
Mutable, Resizable
String name
Object: Array Of bytes
Immutable, replaceable
String name
Page – 4 KiB
12
Simple and efficient
Design for fast storage media
Support file & object-like semantics
Design Goals
BlobFS
Blobstore
BDEV
13
Blobstore Basics The user interacts with chunks of data
called blobs Array of pages Mutable, resizable ID
Asynchronous No blocking, queueing, or waiting
Fully parallel operations No locks in I/O path
I’m very efficient
14
Blobstore Space Allocation
Page 0
Cluster 0
… LBA 252 LBA 253 LBA 254 LBA 255LBA 0 LBA 1 LBA 2 LBA 3
Page 255…
LBA 0 LBA 255
15
Blobstore DesignBlob: array of pages implemented as an ordered list of clusters:
Cluster 455
0-255 512-767 768-1023256-511
Cluster 87Cluster 52Cluster 905
0 1 2 3
LBA 0 LBA N
Page Offsets:
16
LBA 13312
LBA 13313
LBA 13314
LBA 13315
LBA 232583
LBA 232584
Blobstore Sample I/O
Page Offset
256
Page Offset
257
Page Offset
258
Page Offset
259
Page Offset
254
Page Offset
255…
Disk Write(Offset 232583, 2 LBAs)
Disk Write(Offset 13312, 4 LBAs)
Blob Write (Offset 254, 6 pages)
Cluster 9050
Cluster 521
Blobs are read/written by specifying a relative page offset and a page count
… …
…
17
Blobstore Metadata Metadata is stored in pages in a reserved region
Metadata pages are not shared between blobs
A blob may have multiple pages of metadata
Page 0(Blob 1)
Page 1(Blob 2)
Page 2(Blob 3)
Page 3(Blob 1)
Page 4(Blob 4)
SSDMetadata Region
18
open, close, read, write, sync, resize
Asynchronous, callback-driven
Read/write in units of pages, space allocation in clusters
Data is direct
Metadata is cached
Minimal support for xattrs
Blobstore API
Independent of BlobFS19
Layered on Blobstore
User interacts with files
Data can be cached
Synchronous API*
* Asynchronous API possible
BlobFS DesignCore 0 Core 1
I/O Device
open()
write()
open()
read()
Core 2
Async I/O Thread
20
I/O Device
Not a general purpose page cache
Read ahead
Sequential write buffering
All other access bypasses cache
BlobFS CachingCore 0 Core 1
open()
write()
open()
read()
Core 2
Async I/O Thread
write()
write()
read()
21
Benchmark: db_bench Read/Write Latency
System Configuration: 2x Intel® Xeon® E5-2699v3, Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel® P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.02, RocksDB 5.1.2 23
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
50 75
La
ten
cy (
ms)
Percentile Latency
Kernel SPDK
0
2
4
6
8
10
12
14
16
18
20
99 99.9
La
ten
cy (
ms)
Percentile Latency
Kernel SPDK
Benchmark: db_bench Read/Write Throughput
System Configuration: 2x Intel® Xeon® E5-2699v3, Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel® P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.02, RocksDB 5.1.2 24
0
5000
10000
15000
20000
25000
30000
Kernel SPDK
Tra
nsa
ctio
ns
Pe
r S
eco
nd
Next Steps Major API clarifications
More & better benchmarking
Use blobstore as a dynamic partitioner (bdev)
BlobFS caching strategy is RocksDB-centric
Asynchronous BlobFS API
Sparse allocation of blobs
More open source application integration?
26
SPDK Blobstore Vs. Kernel: Latency
0
20000
40000
60000
80000
100000
120000
140000
Readwrite
Late
ncy
uS
db_bench 99.99th Percentile LatencyLower is Better
Kernel (256KB sync) Blobstore (20GB Cache + Readahead)
372%
SPDK Blobstore reduces tail latency by 3.7X
0
1000
2000
3000
4000
5000
6000
7000
Insert Randread Overwrite
Late
ncy
uS
db_bench 99.99th Percentile LatencyLower is Better
Kernel (256KB sync) Blobstore (20GB Cache + Readahead)
21%
44%
28%
System Configuration: 2x Intel® Xeon® E5-2699v3, Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel® P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.03, RocksDB 5.1.2
SPDK Blobstore Vs. Kernel: Transactions Per Second
0
200000
400000
600000
800000
1000000
1200000
Insert Randread Overwrite Readwrite
Ke
ys p
er s
eco
nd
db_bench Key TransactionsHigher is Better
85%
8% 4% ~0%
System Configuration: 2x Intel® Xeon® E5-2699v3, Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel® P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.03, RocksDB 5.1.2