What you need to know about ceph

What you need to know about CephGluster Community Day, 20 May 2014

Haruka Iwao

What is Ceph? Ceph architecture Ceph and OpenStack Wrap-up

What is Ceph?

The name "Ceph" is a common nickname given to pet octopuses, short for cephalopod.

Cephalopod?

Ceph is...

{ }object storage and file system

Open-sourceMassively scalableSoftware-defined

History of Ceph

2003 Project born at UCSC

2006 Open sourced Papers published

2012 Inktank founded “Argonaut” released

In April 2014

Yesterday

Red Hat acquires me

I joined Red Hat as an architect of storage systems

This is just a coincidence.

Red Hat acquires me

Ceph releases

Major release every 3 months Argonaut Bobtail Cuttlefish Dumpling Emperor Firefly Giant (coming in July)

Ceph architecture

Ceph at a glance

Layers in Ceph

RADOS = /dev/sda Ceph FS = ext4

/dev/sda

Reliable Replicated to avoid data loss

Autonomic Communicate each other to

detect failures Replication done transparently

Distributed Object Store

RADOS (2)

Fundamentals of Ceph Everything is stored in

RADOS Including Ceph FS metadata

Two components: mon, osd CRUSH algorithm

Object storage daemon One OSD per disk Uses xfs/btrfs as backend

Btrfs is experimental! Write-ahead journal for

integrity and performance 3 to 10000s OSDs in a cluster

OSD (2)

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

Monitoring daemon Maintain cluster map and

state Small, odd number

Locating objects

RADOS uses an algorithm “CRUSH” to locate objects Location is decided through

pure “calculation”

No central “metadata” server No SPoF Massive scalability

CRUSH1. Assign a placement group pg = Hash(object name) % num pg2. CRUSH(pg, cluster map, rule)

Cluster map

Hierarchical OSD map Replicating across failure domains Avoiding network congestion

Object locations computed

0100111010100111011 Name: abc, Pool: test

Hash(“abc”) % 256 = 0x23“test” = 3

Placement Group: 3.23

PG to OSD

Placement Group: 3.23

CRUSH(PG 3.23, Cluster Map, Rule) → osd.1, osd.5, osd.9

Synchronous Replication

Replication is synchronousto maintain strong consistency

When OSD fails

OSD marked “down” 5mins later, marked “out”

Cluster map updated

CRUSH(PG 3.23, Cluster Map #1, Rule) → osd.1, osd.5, osd.9

CRUSH(PG 3.23, Cluster Map #2, Rule) → osd.1, osd.3, osd.9

Wrap-up: CRUSH

Object name + cluster map → object locations Deterministic

No metadata at all Calculation done on clients Cluster map reflects

network hierarchy

RADOSGW

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

S3 / Swift compatible gateway to RADOS

LIBRADOS

CEPH FS

RADOSGW

RADOS Block Devices

Directly mountable rbd map foo --pool rbd mkfs -t ext4 /dev/rbd/rbd/foo

OpenStack integration Cinder & Glance Will explain later

Ceph FS

LIBRADOS

CEPH FS

RADOSGW

Ceph FS

POSIX compliant file system build on top of RADOS

Can mount with Linux native kernel driver (cephfs) or FUSE

Metadata servers (mds) manages metadata of the file system tree

Ceph FS is reliable

MDS writes journal to RADOS so that metadata doesn’t lose by MDS failures

Multiple MDS can run for HA and load balancing

Ceph FS and OSD

OSDOSDOSD

POSIX Metadata(directory, time, owner, etc)

Write metadata journal

Data I/O

Metadata held in-memory

DYNAMIC SUBTREE PARTITIONING

Ceph FS is experimental

Other features

Rolling upgrades Erasure Coding Cache tiering Key-value OSD backend Separate backend network

Rolling upgrades

No interruption to the service when upgrading

Stop/Start daemons one by one mon → osd → mds →

radowgw

Erasure coding

Use erasure coding instead of parity for data durability

Suitable for rarely modified or accessed objects

Erasure Coding

Replication

Space overhead(survive 2 fails)

Approx 40% 200%

CPU High Low

Latency High Low

Cache tiering

Cache tierex. SSD

Base tierex. HDD,erasure coded

librados

transparent to clients

read/write

read when miss

fetch when missflush to base tier

Key-value OSD backend

Use LevelDB for OSD backend (instead of xfs)

Better performance esp for small objects

Plans to support RocksDB, NVMKV, etc

Separate backend network

Backend network for replication

Clients

Frontend network for service

1. Write

2. Replicate

OpenStack Integration

OpenStack with Ceph

RADOSGW and Keystone

Keystone Server

RADOSGW

RESTful Object Store

Query token

Access with token

Grant/revoke

Glance Integration

Glance Server

/etc/glance/glance-api.conf

default_store=rbdrbd_store_user=glancerbd_store_pool=images

Store, Download

Need just 3 lines!

Cinder/Nova Integration

Cinder Server

librbd

nova-compute

Boot from volume

Management

Volume Image

Copy-on-write clone

Benefits of using with

Unified storage for both images and volumes

Copy-on-write cloning and snapshot support

Native qemu / KVM support for better performance

Wrap-up

Ceph is

Massively scalable storage Unified architecture for

object / block / POSIX FS OpenStack integration is

ready to use & awesome

Ceph and GlusterFSCeph GlusterFS

Distribution Object based File based

File location Deterministic algorithm (CRUSH)

Distributed hash table, stored in xattr

Replication Server side Client side

Primary usage Object / block storage

POSIX-like file system

Challenge POSIX file system needs improvement

Object / block storage needs improvement

What you need to know about ceph

Technology

Transcript of What you need to know about ceph

Ceph Day LA: Ceph Ecosystem Update

Ceph Day Santa Clara: Ceph Performance & Benchmarking

MySQL and Ceph - Percona · Agenda • Ceph Introduction and Architecture • Why MySQL on Ceph • MySQL and Ceph Performance Tuning • Head-to-Head Performance MySQL on Ceph vs.

Performance and Siing Guide: Deploying Red Hat Ceph ...go.qct.io/.../Performance-and-Sizing-Guide-Red-Hat-Ceph-Storage-on... · on Red Hat Ceph Storage—each ... To address the need

Ceph Day Santa Clara: Ceph Fundamentals

Ceph Day New York: Ceph: one decade in

Ceph Day Beijing - Storage Modernization with Intel and Ceph

Navigating Hepatitis C: What Patients Need to Know avigating Hepatitis C: What Patients Need to Know Need to Know.

Ceph Day NYC: Ceph in the Ecosystem

Ceph Day NYC: Ceph Fundamentals

Ceph Day Amsterdam 2015 - Ceph over IPv6

Ceph Day Beijing- Ceph Community Update

London Ceph Day: Ceph at CERN

Ceph Day Santa Clara: Ceph at DreamHost

London Ceph Day: Deploying Ceph and OpenStack with Juju

Ceph Day Beijing: Keynote - Ceph Ecosystem Update

Ceph Day LA: Adventures in Ceph & ISCSI

OpenSDS Manageability using Swordfish for Cloud-native ... · Node Hotpot Dock Ceph Driver Storage A Ceph Ceph Node Ceph Node Ceph Node Node Hotpot Dock Custom Drivers Storage X Storage

Using Ceph in a Private Cloud - Ceph Day Frankfurt

Ceph Day NYC: Ceph Performance & Benchmarking