Block Storage For VMs With Ceph

virtual machine block storage withthe ceph distributed storage system

sage weilxensummit august 28, 2012

outline

why you should care

what is it, what it does

how it works, how you can use itarchitecture

objects, recovery

rados block deviceintegration

path forward

who we are, why we do this

why should you care about anotherstorage system?

requirements, time, cost

requirements

diverse storage needsobject storage

block devices (for VMs) with snapshots, cloning

shared file system with POSIX, coherent caches

structured data... files, block devices, or objects?

scaleterabytes, petabytes, exabytes

heterogeneous hardware

reliability and fault tolerance

time

ease of administration

no manual data migration, load balancing

painless scalingexpansion and contraction

seamless migration

cost

low cost per gigabyte

no vendor lock-in

software solutionrun on commodity hardware

open source

what is ceph?

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodesLIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHPRBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driverCEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSERADOSGW

A bucket-based REST gateway, compatible with S3 and SwiftAPPAPPHOST/VMCLIENT

So what *is* Ceph? Ceph is a massively scalable and flexible object store with tightly-integrated applications that provide REST access to objects, a distributed virtual block device, and a parallel filesystem.

open source

LGPLv2copyleft

free to link to proprietary code

no copyright assignmentno dual licensing

no enterprise-only feature set

active community

commercial support available

distributedstorage system

data center (not geo) scale10s to 10,000s of machines

terabytes to exabytes

fault tolerantno SPoF

commodity hardwareethernet, SATA/SAS, HDD/SSD

RAID, SAN probably a waste of time, power, and money

object storage model

pools1s to 100s

independent namespaces or object collections

replication level, placement policy

objectstrillions

blob of data (bytes to gigabytes)

attributes (e.g., version=12; bytes to kilobytes)

key/value bundle (bytes to gigabytes)

object storage cluster

conventional client/server model doesn't scaleserver(s) become bottlenecks; proxies are inefficient

if storage devices don't coordinate, clients must

ceph-osds are intelligent storage daemonscoordinate with peers

sensible, cluster-aware protocols

sit on local file systembtrfs, xfs, ext4, etc.

leveldb

DISKFSDISKDISKOSDDISKDISKOSDOSDOSDOSDFSFSFSFSbtrfsxfsext4MMM

Lets start with RADOS, Reliable Autonomic Distributed Object Storage.In this example, youve got five disks in a computer. You have initialized each disk with a filesystem (btrfs is the right filesystem to use someday, but until its stable we recommend XFS). On each filesystem, you deploy a Ceph OSD (Object Storage Daemon). That computer, with its five disks and five object storage daemons, becomes a single node in a RADOS cluster. Alongside these nodes are monitor nodes, which keep track of the current state of the cluster and provide users with an entry point into the cluster (although they do not serve any data themselves).

Monitors:Maintain cluster state

Provide consensus for distributed decision-making

Small, odd number

These do not serve stored objects to clients

M

OSDs:One per disk or RAID group

At least three in a cluster

Serve stored objects to clients

Intelligently peer to perform replication tasks

MMM

HUMAN

Applications wanting to store objects into RADOS interact with the cluster as a single entity.

data distribution

all objects are replicated N times

objects are automatically placed, balanced, migrated in a dynamic cluster

must consider physical infrastructureceph-osds on hosts in racks in rows in data centers

three approachespick a spot; remember where you put it

pick a spot; write down where you put it

calculate where to put it, where to find it

CRUSHPseudo-random placement algorithm

Fast calculation, no lookup

Ensures even distribution

Repeatable, deterministic

Rule-based configuration

specifiable replication

infrastructure topology aware

allows weighting

Stable mappingLimited data migration

The way CRUSH is configured is somewhat unique. Instead of defining pools for different data types, workgroups, subnets, or applications, CRUSH is configured with the physical topology of your storage network. You tell it how many buildings, rooms, shelves, racks, and nodes you have, and you tell it how you want data placed. For example, you could tell CRUSH that its okay to have two replicas in the same building, but not on the same power circuit. You also tell it how many copies to keep.

distributed object storage

CRUSH tells us where data should gosmall osd map records cluster state at point in time

ceph-osd node status (up/down, weight, IP)

CRUSH function specifying desired data distribution

object storage daemons (RADOS)store it there

migrate it as the cluster changes

decentralized, distributed approach allowsmassive scales (10,000s of servers or more)

efficient data access

the illusion of a single copy with consistent behavior

large clusters aren't static

dynamic clusternodes are added, removed; nodes reboot, fail, recover

recovery is the norm

osd maps are versionedshared via gossip

any map update potentially triggers data migrationceph-osds monitor peers for failure

new nodes register with monitor

administrator adjusts weights, mark out old hardware, etc.

What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.

The OSDs collectively use the CRUSH algorithm to determine how the cluster should look based on its new state, and move the data to where clients running CRUSH expect it to be.

CLIENT

??Because of the way placement is calculated instead of centrally controlled, node failures are transparent to clients.

what does this mean for my cloud?

virtual disksreliable

accessible from many hosts

appliancesgreat for small clouds

not viable for public or (large) private clouds

avoid single server bottlenecks

efficient management

MMM

VM

LIBRADOSLIBRBDVIRTUALIZATION CONTAINERThe RADOS Block Device (RBD) allows users to store virtual disks inside RADOS.For example, you can use a virtualization container like KVM or QEMU to boot virtual machines from images that have been stored in RADOS. Images are striped across the entire cluster, which allows for simultaneous read access from different cluster nodes.

LIBRADOS

MMM

LIBRBDCONTAINERLIBRADOS

LIBRBDCONTAINER

VM

Separating a virtual computer from its storage also lets you do really neat things, like migrate a virtual machine from one server to another without rebooting it.

LIBRADOS

MMM

KRBD (KERNEL MODULE)HOST

As an alternative, machines (even those running on bare metal) can mount an RBD image using native Linux kernel drivers.

RBD: RADOS Block DeviceReplicated, reliable, high-performance virtual disk

Allows decoupling of VMs and containers

Live migration!

Images are striped across the cluster

Snapshots!

Native support in the Linux kernel/dev/rbd1

librbd allows easy integration

HOW DO YOUSPIN UPTHOUSANDS OF VMsINSTANTLYANDEFFICIENTLY?

144

0

0

0

0

= 144

instant copy

With Ceph, copying an RBD image four times gives you five total copiesbut only takes the space of one. It also happens instantly.

4

144

CLIENTwrite

write

write= 148write

When clients mount one of the copied images and begin writing, they write to their copy.

4

144

CLIENTread

readread= 148When they read, though, they read through to the original copy if theres no newer data.

current RBD integration

native Linux kernel support/dev/rbd0, /dev/rbd//

librbduser-level library

Qemu/KVMlinks to librbd user-level library

libvirtlibrbd-based storage pool

understands RBD images

can only start KVM VMs... :-(

CloudStack, OpenStack

what about Xen?

Linux kernel driver (i.e. /dev/rbd0) easy fit into existing stacks

works today

need recent Linux kernel for dom0

blktapgeneric kernel driver, userland process

easy integration with librbd

more featureful (cloning, caching), maybe faster

doesn't exist yet!

rbd-fusecoming soon!

libvirt

CloudStack, OpenStack

libvirt understands rbd images, storage poolsxml specifies cluster, pool, image name, auth

currently only usable with KVM

could configure /dev/rbd devices for VMs

librbd

managementcreate, destroy, list, describe images

resize, snapshot, clone

I/Oopen, read, write, discard, close

C, C++, Python bindings

RBD roadmap

lockingfence failed VM hosts

clone performance

KSM (kernel same-page merging) hints

cachingimproved librbd caching

kernel RBD + bcache to local SSD/disk

why

limited options for scalable open source storage

proprietary solutionsmarry hardware and software

expensive

don't scale (out)

industry needs to change

who we are

Ceph created at UC Santa Cruz (2007)

supported by DreamHost (2008-2011)

Inktank (2012)

growing user and developer community

we are hiringC/C++/Python developers

sysadmins, testing engineers

Los Angeles, San Francisco, Sunnyvale, remote

http://ceph.com/

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodesLIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHPRBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driverCEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSERADOSGW

A bucket-based REST gateway, compatible with S3 and SwiftAPPAPPHOST/VMCLIENT

So what *is* Ceph? Ceph is a massively scalable and flexible object store with tightly-integrated applications that provide REST access to objects, a distributed virtual block device, and a parallel filesystem.

Block Storage For VMs With Ceph

Technology

Transcript of Block Storage For VMs With Ceph