Block Storage For VMs With Ceph
-
Upload
the-linux-foundation -
Category
Technology
-
view
16.085 -
download
1
Transcript of Block Storage For VMs With Ceph
virtual machine block storage withthe ceph distributed storage system
sage weilxensummit august 28, 2012
outline
why you should care
what is it, what it does
how it works, how you can use itarchitecture
objects, recovery
rados block deviceintegration
path forward
who we are, why we do this
why should you care about anotherstorage system?
requirements, time, cost
requirements
diverse storage needsobject storage
block devices (for VMs) with snapshots, cloning
shared file system with POSIX, coherent caches
structured data... files, block devices, or objects?
scaleterabytes, petabytes, exabytes
heterogeneous hardware
reliability and fault tolerance
time
ease of administration
no manual data migration, load balancing
painless scalingexpansion and contraction
seamless migration
cost
low cost per gigabyte
no vendor lock-in
software solutionrun on commodity hardware
open source
what is ceph?
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodesLIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHPRBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driverCEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSERADOSGW
A bucket-based REST gateway, compatible with S3 and SwiftAPPAPPHOST/VMCLIENT
So what *is* Ceph? Ceph is a massively scalable and flexible object store with tightly-integrated applications that provide REST access to objects, a distributed virtual block device, and a parallel filesystem.
open source
LGPLv2copyleft
free to link to proprietary code
no copyright assignmentno dual licensing
no enterprise-only feature set
active community
commercial support available
distributedstorage system
data center (not geo) scale10s to 10,000s of machines
terabytes to exabytes
fault tolerantno SPoF
commodity hardwareethernet, SATA/SAS, HDD/SSD
RAID, SAN probably a waste of time, power, and money
object storage model
pools1s to 100s
independent namespaces or object collections
replication level, placement policy
objectstrillions
blob of data (bytes to gigabytes)
attributes (e.g., version=12; bytes to kilobytes)
key/value bundle (bytes to gigabytes)
object storage cluster
conventional client/server model doesn't scaleserver(s) become bottlenecks; proxies are inefficient
if storage devices don't coordinate, clients must
ceph-osds are intelligent storage daemonscoordinate with peers
sensible, cluster-aware protocols
sit on local file systembtrfs, xfs, ext4, etc.
leveldb
DISKFSDISKDISKOSDDISKDISKOSDOSDOSDOSDFSFSFSFSbtrfsxfsext4MMM
Lets start with RADOS, Reliable Autonomic Distributed Object Storage.In this example, youve got five disks in a computer. You have initialized each disk with a filesystem (btrfs is the right filesystem to use someday, but until its stable we recommend XFS). On each filesystem, you deploy a Ceph OSD (Object Storage Daemon). That computer, with its five disks and five object storage daemons, becomes a single node in a RADOS cluster. Alongside these nodes are monitor nodes, which keep track of the current state of the cluster and provide users with an entry point into the cluster (although they do not serve any data themselves).
Monitors:Maintain cluster state
Provide consensus for distributed decision-making
Small, odd number
These do not serve stored objects to clients
M
OSDs:One per disk or RAID group
At least three in a cluster
Serve stored objects to clients
Intelligently peer to perform replication tasks
MMM
HUMAN
Applications wanting to store objects into RADOS interact with the cluster as a single entity.
data distribution
all objects are replicated N times
objects are automatically placed, balanced, migrated in a dynamic cluster
must consider physical infrastructureceph-osds on hosts in racks in rows in data centers
three approachespick a spot; remember where you put it
pick a spot; write down where you put it
calculate where to put it, where to find it
CRUSHPseudo-random placement algorithm
Fast calculation, no lookup
Ensures even distribution
Repeatable, deterministic
Rule-based configuration
specifiable replication
infrastructure topology aware
allows weighting
Stable mappingLimited data migration
The way CRUSH is configured is somewhat unique. Instead of defining pools for different data types, workgroups, subnets, or applications, CRUSH is configured with the physical topology of your storage network. You tell it how many buildings, rooms, shelves, racks, and nodes you have, and you tell it how you want data placed. For example, you could tell CRUSH that its okay to have two replicas in the same building, but not on the same power circuit. You also tell it how many copies to keep.
distributed object storage
CRUSH tells us where data should gosmall osd map records cluster state at point in time
ceph-osd node status (up/down, weight, IP)
CRUSH function specifying desired data distribution
object storage daemons (RADOS)store it there
migrate it as the cluster changes
decentralized, distributed approach allowsmassive scales (10,000s of servers or more)
efficient data access
the illusion of a single copy with consistent behavior
large clusters aren't static
dynamic clusternodes are added, removed; nodes reboot, fail, recover
recovery is the norm
osd maps are versionedshared via gossip
any map update potentially triggers data migrationceph-osds monitor peers for failure
new nodes register with monitor
administrator adjusts weights, mark out old hardware, etc.
What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.
The OSDs collectively use the CRUSH algorithm to determine how the cluster should look based on its new state, and move the data to where clients running CRUSH expect it to be.
CLIENT
??Because of the way placement is calculated instead of centrally controlled, node failures are transparent to clients.
what does this mean for my cloud?
virtual disksreliable
accessible from many hosts
appliancesgreat for small clouds
not viable for public or (large) private clouds
avoid single server bottlenecks
efficient management
MMM
VM
LIBRADOSLIBRBDVIRTUALIZATION CONTAINERThe RADOS Block Device (RBD) allows users to store virtual disks inside RADOS.For example, you can use a virtualization container like KVM or QEMU to boot virtual machines from images that have been stored in RADOS. Images are striped across the entire cluster, which allows for simultaneous read access from different cluster nodes.
LIBRADOS
MMM
LIBRBDCONTAINERLIBRADOS
LIBRBDCONTAINER
VM
Separating a virtual computer from its storage also lets you do really neat things, like migrate a virtual machine from one server to another without rebooting it.
LIBRADOS
MMM
KRBD (KERNEL MODULE)HOST
As an alternative, machines (even those running on bare metal) can mount an RBD image using native Linux kernel drivers.
RBD: RADOS Block DeviceReplicated, reliable, high-performance virtual disk
Allows decoupling of VMs and containers
Live migration!
Images are striped across the cluster
Snapshots!
Native support in the Linux kernel/dev/rbd1
librbd allows easy integration
HOW DO YOUSPIN UPTHOUSANDS OF VMsINSTANTLYANDEFFICIENTLY?
144
0
0
0
0
= 144
instant copy
With Ceph, copying an RBD image four times gives you five total copiesbut only takes the space of one. It also happens instantly.
4
144
CLIENTwrite
write
write= 148write
When clients mount one of the copied images and begin writing, they write to their copy.
4
144
CLIENTread
readread= 148When they read, though, they read through to the original copy if theres no newer data.
current RBD integration
native Linux kernel support/dev/rbd0, /dev/rbd//
librbduser-level library
Qemu/KVMlinks to librbd user-level library
libvirtlibrbd-based storage pool
understands RBD images
can only start KVM VMs... :-(
CloudStack, OpenStack
what about Xen?
Linux kernel driver (i.e. /dev/rbd0) easy fit into existing stacks
works today
need recent Linux kernel for dom0
blktapgeneric kernel driver, userland process
easy integration with librbd
more featureful (cloning, caching), maybe faster
doesn't exist yet!
rbd-fusecoming soon!
libvirt
CloudStack, OpenStack
libvirt understands rbd images, storage poolsxml specifies cluster, pool, image name, auth
currently only usable with KVM
could configure /dev/rbd devices for VMs
librbd
managementcreate, destroy, list, describe images
resize, snapshot, clone
I/Oopen, read, write, discard, close
C, C++, Python bindings
RBD roadmap
lockingfence failed VM hosts
clone performance
KSM (kernel same-page merging) hints
cachingimproved librbd caching
kernel RBD + bcache to local SSD/disk
why
limited options for scalable open source storage
proprietary solutionsmarry hardware and software
expensive
don't scale (out)
industry needs to change
who we are
Ceph created at UC Santa Cruz (2007)
supported by DreamHost (2008-2011)
Inktank (2012)
growing user and developer community
we are hiringC/C++/Python developers
sysadmins, testing engineers
Los Angeles, San Francisco, Sunnyvale, remote
http://ceph.com/
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodesLIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHPRBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driverCEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSERADOSGW
A bucket-based REST gateway, compatible with S3 and SwiftAPPAPPHOST/VMCLIENT
So what *is* Ceph? Ceph is a massively scalable and flexible object store with tightly-integrated applications that provide REST access to objects, a distributed virtual block device, and a parallel filesystem.