Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for...

ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks

Jian Yang, Joseph Izraelevitz, Steven Swanson

Non-Volatile Systems Laboratory

Department of Computer Science & Engineering

University of California, San Diego

RDMA Network

• PCM, STT-RAM, ReRAM, Intel 3DXPoint

• Performant: DRAM-class latency/BW

• Byte-addressable

• Persistent over power failures

Non-Volatile Main Memory (NVMM)

Accessing NVMM as Remote Storage

Application

MemoryDRAM NVMM

Remote Direct Memory Access (RDMA)

• DMA from/to remote memory

• Two-sided verbs (Send/Recv)

• One-sided verbs (Read/Write)

• Bypasses remote CPU

• Byte-addressable

Application

DRAM NVMM

Application

DRAM NVMM

Application

DRAM NVMM

Accessing Local NVMM vs. Remote NVMM

19x241x

Latency of write(4KB) + fsync()

NVMMFS

iSCSI/RDMA

Ceph/RDMA

NVMMFS

Ceph/RDMA

iSCSI/RDMA

Throughput of fileserver workload

6653MB/s

3.86 us

Distributed FS

Application

Distributed FS

Access remote NVMM over Dist. FS

Application

NVMM FS

Access local NVMM

(NOVA)

Better

Issue #1: Existing Dist. FSs are Slow on NVMM

• Layered Design

• Indirection overhead

• Expensive to persist (e.g., fsync())

RPCStorage ServerStorage Client

Application

File System

File Access

Block Access

Kernel

File System/FUSE

File Request

NVMM is Faster than RDMA

RDMA Network NVMM

300 ns

2-16 GB/s

64 Byte (Cacheline)

5 GB/s

2 KB (MTU)

Latency

Bandwidth

Access Size

Harddrive (NVMe)

70 μs

1.3-3.2 GB/s

4 KB (Page)(@ Max BW)

Networking is faster than storage

NVMM is Faster than RDMA

RDMA Network NVMM

300 ns

2-16 GB/s

64 Byte (Cacheline)

5 GB/s

2 KB (MTU)

Latency

Bandwidth

Access Size

Harddrive (NVMe)

70 μs

1.3-3.2 GB/s

4 KB (Page)(@ Max BW)

NVMM is faster than networking

Issue #2: Lack of Support for Local NVMM

NVMM FS

Storage ServerStorage Client

Application

File System

Remote NVMMLocal NVMM

Page Cache

• Use case of converged storage

– Local NVMM supports Direct Access (DAX)

• Existing systems do not store data at local

• Run Local FS and Dist. FS

– Expensive to move data DAX

ORION: A Distributed File System forNVMM and RDMA-Capable Networks

• A clean slate design for NVMM and RDMA

• A unified layer: kernel FS + networking

• Pooled NVMM storage

• Accessing metadata/data directly over

Direct Access (DAX) and RDMA

• Designed for rack-scale scalability

Application Application

RDMA AccessDAX Access

POSIX I/O

NVMM NVMMPooled NVMM

Outline

• Background

• Design overview

• Metadata and data management

• Replication

• RDMA persistence

• Evaluation

• Conclusion

Data StoreData StoreData Store

ClientClient

Client

Metadata

ORION: Cluster Overview

Sync/Update

DAXRead/Write

RDMA Read/Write

• Metadata Server (MDS): Runs ORIONFS,

keeps authoritative metadata of the whole FS

• Client: Runs ORIONFS, keeps active metadata

and cached data. Access local NVMM

• Data Store (DS): Pooled NVMM data

• Metadata Access: Clients <=> MDS (Two-sided)

• Data Access: Clients => DSs (One-sided)

Metadata

Data $

The ORION File System

• Inherited from NOVA [Xu, FAST 16]

– Per-inode metadata (operation) log

– Build in-DRAM data structures on recovery

– Atomic log append

• Metadata:

– DMA (Physical) memory region (MR)

– RDMA-able metadata structures

• Data: Globally partitioned

MDS MDS

Client

Data Store

inodes

dentries

inodes

dentries

Data $

inode log

DataData [X][Y]

Operations in ORION File System

• Open(a)– Allocate inode

Client

Data $

inode log

Data[X]

RecvQClient

RPC: OpenPath: aWAddr:&alloc

– Issue an RPC via RDMA_Send

Client

Data $

inode log

Data[X]

RecvQClient

Client

– Issue an RPC via RDMA_Send– RDMA_Write to allocated space

Client

Data $

inode log

Data[X]

Client

– Issue an RPC via RDMA_Send– RDMA_Write to allocated space

– RDMA_Read the rest of the log

Client

Data $

inode log

Data[X]

Client

• Write(c)– Allocate & CoW to client-owned pages

Client

Data $

inode log

Data[X]

Client

– Append log entry

Client

Data $

inode log

Data[X]

Client

Log: FileWriteAddr=(X,addr)Size=4096

– Commit log entry via RDMA_Send

Client

Data $

inode log

Data[X]

Client

– Commit log entry via RDMA_Send– Append log entry

Client

Data $

inode log

Data[X]

Client

– Commit log entry via RDMA_Send– Append log entry

– Update tail pointers atomically

Client

Data $

inode log

Data[X]

aMDSClient

Client

• Tailcheck(b)– Log commit from another client

Client

Data $

inode log

Data[X]

– RDMA_Read remote log tail

Client

Data $

inode log

Data[X]

Client

– RDMA_Read remote log tail

– Read from MDS if Len(Local) < Len(Remote)

Client

Data $

inode log

Data[X]

Client

• Read(b)– Tailcheck (async)

– RDMA_Read from data store

• Data locality

– Future reads will hit DRAM cache

– Future writes will go to local NVMM

Client

Data $

inode log

Data[X]

Data Store

Type=FileWriteAddr=(Y,3)Size=4096

Client

Type=FileWriteAddr=($,0)Size=4096

• Read(b)– Tailcheck (async)

– RDMA_Read from data store

– In-place update to log entry

• Data locality

– Future reads will hit DRAM cache

– Future writes will go to local NVMM

Client

Data $

inode log

Data[X]

Data Store

Client

15Mop/s

Accelerating Metadata Accesses

• Observations:

– RDMA prefers inbound operations [Su, EuroSys 17]

– RDMA prefers small operations [Kalia, ATC 16]

• MDS request handling:

– Tailcheck (8B RDMA_Read): MDS-bypass

– Log Commit (~128B RDMA_Send): Single-inode operations

– RPC (Varies): Other operations, less common

1.9Mop/s

3.3 us

One-sided RDMA

RDMA Send

8 Bytes

512 Bytes

Read [Inbound]

Write[Outbound]

Latency

Throughtput

1.3 us

#1 #3 #3#3

#3#2#1#1

Optimizing Log Commits

• Speculative log commit:

– Return when RDMA_Send verb is signaled

– Tailcheck before send

– Rebuild inode from log when necessary

– RPCs for complex operations (e.g.

O_APPEND)

• Log commit + Persist: ~ 500 CPU Cycles

2 inode log

inode log

MDS Tail

Local Tail

(memcpy) (flush+fence)

Rebuild

Evaluation

ORION Prototype

• ORION kernel modules (~15K LOC)

• Linux Kernel 4.10

• RDMA Stack: MLNX_OFED 4.3

• Bind to 1 core for each client

Networking

• 12 Nodes connected to a switch

• InfiniBand Switch (QLogic 12300)

Hardware

• 2x Intel Westmere-EP CPU

• 16GB DRAM as DRAM

• 32GB DRAM as NVMM

• RNIC: Mellanox ConnectX-2 VPI (40Gbps)

Create Mkdir Unlink Rmdir FIO 4K

FIO 4K

Orion vs. Distributed File Systems

Orion Ceph Gluster

Evaluation: File Operations

582/932 417

Create Mkdir Unlink Rmdir FIO 4K

FIO 4K

Orion vs. Local File Systems

Orion NOVA Ext4-DAX

Better Better

1.3x162x

Evaluation: Applications

varmail fileserver webserver mongodb

Filebench workloads and MongoDB (YCSB-A) throughput (relative to NOVA)

Orion NOVA Ext4-DAX Ceph GlusterBetter

Evaluation: Metadata Accesses

1 2 3 4 5 6 7 8

# Clients

Relative throughput of metadata operations

Tailcheck Log Commit RPC0

8 Clients

Throughput of metadata operations

Tailcheck Log Commit RPCBetter

Conclusion

• Existing distributed file systems lack of NVMM support and have significant

software overhead

• ORION unifies the NVMM file system and the networking layer

• ORION provides fast metadata accesses

• ORION allows DAX to local NVMM data

• Performance comparable to local NVMM file systems

Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for...

Documents

Transcript of Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for...

W.B.Mori UCLA Orion Center: Computer Simulation. Simulation component of the ORION Center Just as the ORION facility is a resource for the ORION Center,

Orion Brochure

Constellation Close-Up: Orion · •Orion Nebula •Horsehead Nebula •M78 nebulosity •Double stars in Orion. Introduction to Orion. Orion Stars to mag 7.0. Orion Barn-door mount,

Orion Manual

Orion Study

Orion Cooker

Orion Multigas Detector Operation Manual Detector Multigas ... MULTIGAS ORION... · Orion® Multigas Detector Operation Manual Detector Multigas Orion® Manual de operaciones Détecteur

Orion 8102BNU, 8102SCU, 800500U Orion ROSS Ultra pH Electrode · Orion ROSS Ultra pH Electrode INSTRUCTION MANUAL ... ROSS Filling Solution, Orion 810007, try filling the electrode

Bobinadora Orion

Dalí: APeriodicallyPersistentHashMapDalí: APeriodicallyPersistentHashMap Faisal Nawab ∗1, Joseph Izraelevitz 2, Terence Kelly 3, Charles B. Morrey III∗4, Dhruva R. Chakrabarti∗5,

Orion presentation

Metallica - Orion

Orion 410A+, 420A+, 920A+ Orion Aplus - …physics.ucsd.edu/neurophysics/Manuals/Thermo/Benchtop pH and pH_ise...Orion 520Aplus Advanced pH/mV/Temperature Meter, the Orion 525Aplus

ORION Constellation. Orion Orion the Hunter Orion was thought to be a great hunter, always traveling with his faithful hunting dogs (Canis Major, Canis.

Characterization of Volatile and Non-Volatile Compounds ...

RUPAREL ORION

Orion 2 BMS Operation Manual - Orion Li-Ion Battery ... Orion 2 BMS Operation Manual The Orion BMS 2 by Ewert Energy Systems is the second generation of the Orion BMS. The Orion BMS

Orion Animal Health - qsb.webcast.fiqsb.webcast.fi/o/orion/orion_2015_0526_cmd_07/CMD2015_07.pdf · Orion Pharma Animal Health 2 26 May 2015 Orion Capital Markets Day 2015 •2014

Orion: A Distributed File System for Non-Volatile Main ...swanson.ucsd.edu/data/bib/pdfs/2019FAST-Orion.pdfOrion: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable

Orion powerpoint