Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Post on 22-Jan-2018

438 views 0 download

Transcript of Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

KernelTLV Filesystem Supersession

Part I File Systems: why, how and where (another slide deck)Philip.Derbeko@gmail.com

Part II Emerging PM HW & SW stack implicationsAmit.Golander@netapp.com

Part III ZUFS – PM-based file systems in user spaceShachar.Sharon@netapp.comBoaz.Harrosh@netapp.com

© 2017 NetApp, Inc. All rights reserved1

KernelTLV MeetupNov. 14th 2017

Part IIPersistent Memory (PM) – HW & SW implications

Emerging PM/NVDIMM devices, the value they bring to applications and

how they revolutionize the storage stack

© 2017 NetApp, Inc. All rights reserved2

KernelTLV MeetupNov. 14th 2017

About

© 2017 NetApp, Inc. All rights reserved.3

~12,000employees

50+countries

Only Top 5 vendor that is rapidly

growing

Celebrating

25 Years

Founded in 1992

NetApp acquired @ June 2017

TLV areabased

Recruiting FS Dev.

+1

PM SoftwarePioneer

Since 2014

Ground breaking Latencies

Storage Media Generations

© 2017 NetApp, Inc. All rights reserved4

PM marries the best of both worlds: + StoragePersistency

MemorySpeed

HDD FLASH

IOPS(even if random…)

Latency(even under load…)

NVDIMM / PM

Definitions

Rounded latency numbers & under typical load

5

SCM (Storage Class Memory)

Byte-addressable Media@ Near-memory speed

<1us

5 © 2017 NetApp, Inc. All rights reserved.

PM (Persistent Memory)

Byte-addressable Device@ Near-memory speed, on memory bus

PM-based Storage - Question Traditional Assumptions

Byte-addressable media

Block-addressable wrapper

SW layers Network SW caching Block abstraction

?

66 © 2017 NetApp, Inc. All rights reserved.

Memory Vs. Storage

Block wrapper

PM-based FS

Application

Block-based FS

Page Cache

bio

PM-based Software Approaches

Application Re-written Application

NPM

DAX-enabled FS

SW reuse Performance

App

SWInfrastructure

HW

Linux Kernel Enablers

“-o dax”

Built in Kernel driver nd_btt.ko. Source: drivers/nvdimm/btt.c

Built in Kernel driver nd_pmem.ko. Source: drivers/nvdimm/pmem.c

Built in Kernel driver core.ko. Source: drivers/nvdimm/core.c

Linux 4.1+ subsystems added support of NVDIMMs. Mostly stable from 4.4

NVDIMM modules presented as device links: /dev/pmem0, /dev/pmem1

QEMO support

BTT (Block, Atomic)

PMEM (Direct Access)

DAX Enabled FS

NFIT Core

8Can also refer to kTLV Meetup from 2016 - https://www.youtube.com/watch?v=FVrgt9JtcwQ

Block wrapper

PM-based FS

Applications

DAX-enabled FS

Storage semantics

PM-based Software Approaches

Memorysemantics

Block-based FS

Page Cache

bio

NPM

Mmap, ld/st, msyncRead/write, fsync

NVDIMM Driver

Examples:

Block wrapper

PM-based FS

Applications

DAX-enabled FS

Storage semantics

Memorysemantics

Block-based FS

Page Cache

bio

NPM

NTFS-DAXREFS-DAX

XFS-DAXEXT4-DAX

NOVALUMFS SIMFSHINFS

Plexistor M1FS

NVDIMM Driver

Examples:

Windows server 2016Linux 4.4 and aboveUbuntu 16.04 LTSRHEL 7.3Fedora 23SLES 12 SP2

Examples:

NVML 1.3

(*) Huge variance in features and stability(*) Good portability

PM-based Software Approaches

11 1111© 2017 NetApp, Inc. All rights reserved.

Part IIIZUFS - Zero-copy User-mode FS

New style user-mode filesystems that require: - Extremely Low-Latency- Synchronous & DAX

© 2017 NetApp, Inc. All rights reserved12

KernelTLV MeetupNov. 14th 2017

From VFS to Zufs

© 2017 NetApp, Inc. All rights reserved13

Why Userspace?

• Resiliency

• Ease of development

• Externals (e.g. compress, encrypt)

• Licensing

• Market requirements (avoid kernel modules)

© 2017 NetApp, Inc. All rights reserved14

ZUFS and FUSE are complementary tools

© 2017 NetApp, Inc. All rights reserved15

MotivationFuSE is great for HDDs and ok(ish) for SSDs, but not suitable for PMEM

© 2017 NetApp, Inc. All rights reserved.16

FlashHDD Memory

FUSE

SCM

?RDMATCP

Latency$/GB

FUSE ZUFS

Typical medias Built for HDDs & extended to Flash Built for PM/NVDIMMs and DRAM

SW Perf. goals • Secondary (High latency media)• Async I/O Throughput

• SW is the bottleneck • Latency is everything

SW caching Slow media ->Rely on OS Page Cache

Near-memory speed media ->Bypass OS Page Cache

Access method I/O only I/O and mmap (DAX)

Cost of redundant copy / context switch

Negligible The bottleneck ->Avoid copies, queues & remain on core

Latency penalty under load

100s of µs 3-4 µs

De

sig

n A

ss

um

pti

on

s

Zufs Overview

Core 1

Core 2

Core 3

Core 4

© 2017 NetApp, Inc. All rights reserved18

Kernel to Userspace

© 2017 NetApp, Inc. All rights reserved19

ZUFS – Kernel Zoom-in

© 2017 NetApp, Inc. All rights reserved20

KernelTLV MeetupNov. 14th 2017

Preliminary Results FUSE Vs. ZUFS (PM Media)

© 2017 NetApp, Inc. All rights reserved.21

• Measured on Dual socket Intel XEON 2650v4 (48 HW Threads)DRAM-backed PMEM type

• Random 4KB DirectIO writ(ish) access

Architecture

© 2017 NetApp, Inc. All rights reserved.22

APP

zt-vma

PPP

App pages Mapped into Server VM

Unmapped on return

ZUSZu Server

ZUFZu Feeder

zt per cpu ...

kernel

ZT - ZUFS Thread per CPU, affinity on a single CPU (thread_fifo/rr)

Special ZUFS communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)

ZT-vma - Mmap 4M vma zero copy communication area per ZT

IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation

On App IO current CPU ZT is selected, app pages mapped into ZT-vma. Server thread released with an operation

After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released, Server wait for new operation.

On exit (or server crash) file is closed, Kernel cleans all resources

Async operation is also supported. Server can return EAGAIN.

Server will later complete the operation ASYNC. App will be woken up.

Application mmap (DAX) is the opposite direction. ZUS exposes pages (opt_get_data_block) into the app VM

© 2017 NetApp, Inc. All rights reserved. 23

Architecture

Perf. Optimizations - 1 MMAP_LOCAL_CPU

© 2017 NetApp, Inc. All rights reserved24

• mm patch to allow single-core TLB invalidate (in the common case)

0

5

10

15

20

25

30

- 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000

Late

ncy

[us]

IOPS

ZUFS w/wo mm patch

ZUFS_unpatched_mm

ZUFS_patched_mm

Perf. Optimizations - 2

© 2017 NetApp, Inc. All rights reserved. 25

• scheduler patch to allow efficient context switch on same core (Relay Object)

UnimplementedNo Perf. Results

© 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---26

Questions