Ceph and Mirantis OpenStack
-
Upload
mirantis -
Category
Technology
-
view
10.284 -
download
0
description
Transcript of Ceph and Mirantis OpenStack
Ceph in Mirantis OpenStack
Dmitry Borodaenko
Mountain View, 2014
The Plan
1. What is Ceph?2. What is Mirantis OpenStack?3. How does Ceph fit into OpenStack?4. What has Fuel ever done for Ceph?5. What does it look like?6. Things we’ve done7. Disk partition for Ceph OSD8. Cephx authentication settings9. Types of VM migrations
10. Live VM migrations with Ceph11. Thinks we left undone12. Diagnostics and troubleshooting13. Resources
What is Ceph?
Ceph is a free clustered storage platform that provides unifiedobject, block, and file storage.
Object Storage RADOS objects support snapshotting, replication,and consistency.
Block Storage RBD block devices are thinly provisioned overRADOS objects and can be accessed by QEMU vialibrbd library.Kernel Module librbd
RADOS Protocol
OSDs Monitors
File Storage CephFS metadata servers (MDS) provide aPOSIX-compliant overlay over RADOS.
What is Mirantis OpenStack?OpenStack is an open source cloud computing platform.
Nova VM Swift
Cinder Glance
storesprovisions objects in
storesprovides provides images involumes for images for
Mirantis ships hardened OpenStack packages and provides Fuelutility to simplify deployment of OpenStack and Ceph.
Fuel uses Cobbler, MCollective, and Puppet to discovernodes, provision OS, and setup OpenStack services.Fuel master node
serialize
orchestrate
Target node
configure
start
Astute Nailgun facts Puppet
Cobbler MCollective MCollective Agent
provision
How does Ceph fit into OpenStack?RBD drivers for OpenStack make libvirtconfigure the QEMU interface to librbd.
Ceph benefits:I Multi-node striping and redundancy
for block storage (Cinder volumesand Nova ephemeral drives)
I Copy-on-write cloning of images tovolumes and instances
I Unified storage pool for all types ofstorage (object, block, POSIX)
I Live migration of Ceph-backedinstances
OpenStack
libvirt
QEMU
librbd
librados
OSDs Monitors
configures
Problems: sensitivity to clock drift, multi-site (async replication inEmperor), block storage density (erasure coding in Firefly), SwiftAPI gap (rbd backend for Swift)
What has Fuel ever done for Ceph?1. Fuel deploys Ceph Monitors and OSDs on dedicated nodes or
in combination with OpenStack components.
controller 3
controller
ceph-mon
controller 2
controller
ceph-mon
controller 1
controller
ceph-mon
nova
ceph client
compute 1
compute n. . .
ceph-osd
ceph-osd
storage 1
storage n. . . sto
rag
e n
etw
ork
management network
2. Creates partitions for OSDs when nodes are provisioned.3. Creates separate RADOS pools and sets up Cephx
authentication for Cinder, Glance, and Nova.4. Configures Cinder, Glance, and Nova to use RBD backend
with the right pools and credentials.5. Deploys RADOS Gateway (S3 and Swift API frontend to
Ceph) behind HAProxy on controller nodes.
What does it look like?
Select storage options ⇒ assign roles to nodes ⇒ allocate disks:
Things we’ve done
1. Set the right GPT type GUIDs on OSD and journal partitionsfor udev automount rules
2. ceph-deploy: set up root SSH between Ceph nodes3. Basic Ceph settings: cephx, pool size, networks4. Cephx: ceph auth command line can’t be split5. Rados Gateway: has to be the Inktank’s fork of FastCGI, set
an infinite revocation interval for UUID auth tokens to work6. Patch Cinder to convert non-raw images when creating an
RBD backed volume from Glance7. Patch Nova: clone RBD backed Glance images into RBD
backed ephemeral volumes, pass RBD user to qemu-img8. Ephemeral RBD: disable SSH key injection, set up Nova,
libvirt, and QEMU for live migrations
Disk partitioning for Ceph OSD
Flow of disk partitioning information during discovery,configuration, provisioning, and deployment:
Fuel master node
allocation
ceph-osdrole volumes ks_spaces
Target nodescan
disks scan
osd:journalcreate
settype
Facterosd_devices_listFuel UI Nailgun MCAgent
parted Base OS
OSD
OSD
Journal
Puppetceph::osd
openstack.json Cobbler pmanager
sgdiskceph-deploy
GPT partition type GUIDs according to ceph-disk:
JOURNAL_UUID = ’45b0969e -9b03 -4f30 -b4c6 -b4b80ceff106 ’OSD_UUID = ’4fbd7e29 -9d25 -41b8-afd0 -062 c0ceff05d ’
If more than one device is allocated for OSD Journal, journaldevices are evenly distributed between OSDs.
Cephx authentication settings
Monitor ACL is the same for all Cephx users:allow r
OSD ACLs vary per OpenStack component:Glance: allow class -read object_prefix rbd_children ,
allow rwx pool=images
Cinder: allow class -read object_prefix rbd_children ,allow rwx pool=volumesallow rx pool=images
Nova: allow class -read object_prefix rbd_children ,allow rwx pool=volumesallow rx pool=imagesallow rwx pool=compute
Watch out: Cephx is easily tripped up by unexpected whitespace inceph auth command line parameters, so we have to keep them allon a single line.
Types of VM migrations
OpenStack:Live vs offline: Is VM stopped during migration?Block vs shared storage vs volume-backed: Is VM data shared
between nodes? Is VM metadata (e.g. libvirt domainXML) shared?
Libvirt:Native vs tunneled: Is VM state transferred directly between
hypervisors or tunneled by libvirtd?Direct vs peer-to-peer: Is migration controlled by libvirt client or by
source libvirtd?Managed vs unmanaged: Is migration controlled by libvirt or by
hypervisor itself?Our type:Live, volume-backed*, native, peer-to-peer, managed.
Live VM migrations with Ceph
I Enable native peer to peer live migration:
Source compute node Destination compute node
VM-A VM-B VM-C VM-C VM-D VM-E
Nova libvirtd libvirtd Nova
libvirt VIR_MIGRATE_* flags: LIVE, PEER2PEER,UNDEFINE_SOURCE, PERSIST_DEST
I Patch Nova to decouple shared volumes from shared libvirtmetadata logic during live migration
I Set VNC listen address to 0.0.0.0 and block VNC from outsidethe management network in iptables
I Open ports 49152+ between computes for QEMU migrations
Things we left undone
1. Non-root user with sudo for ceph-deploy2. Calculate PG numbers based on the number of OSDs3. Ceph public network should go to a second storage network
instead of management4. Dedicated Monitor nodes, list all Monitors in ceph.conf on
each Ceph node5. Multi-backend configuration for Cinder6. A better way to configure pools for OpenStack services (than
CEPH_ARGS in the init script)7. Make Nova update VM’s VNC listen address to
vncserver_listen of the destination compute after migration8. Replace ’qemu-img convert’ with clone_image() in
LibvirtDriver.snapshot() in Nova
Diagnostics and troubleshootingceph -sceph osd treecinder create 1rados dfqemu -img convert -O raw cirros.qcow2 cirros.rawglance image -create --name cirros -raw --is-public yes \
--container -format bare --disk -format raw < cirros.rawnova boot --flavor 1 --image cirros -raw vm0nova live -migration vm0 node -3
disk partitioning failed during provisioning – check if traces ofprevious partition tables are left on any drives
’ceph-deploy config pull’ failed – check if the node can ssh to theprimary controller over management network
HEALTH_WARN: clock skew detected – check your ntpd settings,make sure your NTP server is reachable from all nodes
ENOSPC when storing small objects in RGW – try setting asmaller rgw object stripe size
Resources
Read the docs:http://ceph.com/docs/next/rbd/rbd-openstack/http://docs.mirantis.com/fuel/fuel-4.0/http://libvirt.org/migration.htmlhttp://docs.openstack.org/admin-guide-cloud/content/ch_introduction-to-openstack-compute.html
Get the code:
I Mirantis OpenStack ISO image and VirtualBox scripts,I ceph Puppet module for Fuel,I Josh Durgin’s havana-ephemeral-rbd branch for Nova.
Vote on Nova bugs:#1226351, #1261675, #1262450, #1262914.
Sign up for Mirantis and Inktank webcast on Ceph and OpenStack.