CloudOpen - 08/29/2012
-
Upload
inktank -
Category
Technology
-
view
762 -
download
0
description
Transcript of CloudOpen - 08/29/2012
![Page 1: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/1.jpg)
ceph – a unified distributed storage system
sage weilcloudopen – august 29, 2012
![Page 2: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/2.jpg)
outline
● why you should care● what is it, what it does● how it works
● architecture
● how you can use it● librados● radosgw● RBD● file system
● who we are, why we do this
![Page 3: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/3.jpg)
why should you care about anotherstorage system?
![Page 4: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/4.jpg)
requirements
● diverse storage needs● object storage● block devices (for VMs) with snapshots, cloning● shared file system with POSIX, coherent caches● structured data... files, block devices, or objects?
● scale● terabytes, petabytes, exabytes● heterogeneous hardware● reliability and fault tolerance
![Page 5: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/5.jpg)
time
● ease of administration● no manual data migration, load balancing● painless scaling
● expansion and contraction● seamless migration
![Page 6: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/6.jpg)
cost
● linear function of size or performance● incremental expansion
● no fork-lift upgrades
● no vendor lock-in● choice of hardware● choice of software
● open
![Page 7: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/7.jpg)
what is ceph?
![Page 8: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/8.jpg)
unified storage system
● objects● native● RESTful
● block● thin provisioning, snapshots, cloning
● file● strong consistency, snapshots
![Page 9: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/9.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 10: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/10.jpg)
open source
● LGPLv2● copyleft● ok to link to proprietary code
● no copyright assignment● no dual licensing● no “enterprise-only” feature set
● active community● commercial support
![Page 11: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/11.jpg)
distributed storage system
● data center scale● 10s to 10,000s of machines● terabytes to exabytes
● fault tolerant● no single point of failure● commodity hardware
● self-managing, self-healing
![Page 12: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/12.jpg)
ceph object model
● pools● 1s to 100s● independent namespaces or object collections● replication level, placement policy
● objects● bazillions● blob of data (bytes to gigabytes)● attributes (e.g., “version=12”; bytes to kilobytes)● key/value bundle (bytes to gigabytes)
![Page 13: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/13.jpg)
why start with objects?
● more useful than (disk) blocks● names in a single flat namespace● variable size● simple API with rich semantics
● more scalable than files● no hard-to-distribute hierarchy● update semantics do not span objects● workload is trivially parallel
![Page 14: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/14.jpg)
HUMANHUMAN COMPUTERCOMPUTER DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
![Page 15: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/15.jpg)
HUMANHUMAN COMPUTERCOMPUTER DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
![Page 16: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/16.jpg)
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN (actually more like this…)
(COMPUTER)(COMPUTER)
![Page 17: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/17.jpg)
DISKDISK
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
![Page 18: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/18.jpg)
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
MMM
![Page 19: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/19.jpg)
Monitors:
• Maintain cluster membership and state
• Provide consensus for distributed decision-making
• Small, odd number
• These do not serve stored objects to clients
M
Object Storage Daemons (OSDs):• At least three in a cluster• One per disk or RAID group• Serve stored objects to clients• Intelligently peer to perform
replication tasks
![Page 20: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/20.jpg)
M
M
M
HUMAN
![Page 21: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/21.jpg)
data distribution
● all objects are replicated N times● objects are automatically placed, balanced, migrated
in a dynamic cluster● must consider physical infrastructure
● ceph-osds on hosts in racks in rows in data centers
● three approaches● pick a spot; remember where you put it● pick a spot; write down where you put it● calculate where to put it, where to find it
![Page 22: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/22.jpg)
CRUSH• Pseudo-random placement
algorithm
• Fast calculation, no lookup
• Repeatable, deterministic
• Ensures even distribution
• Stable mapping
• Limited data migration
• Rule-based configuration
• specifiable replication
• infrastructure topology aware
• allows weighting
![Page 23: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/23.jpg)
10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
hash(object name) % num pg
CRUSH(pg, cluster state, policy)
![Page 24: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/24.jpg)
10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
![Page 25: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/25.jpg)
RADOS
● monitors publish osd map that describes cluster state● ceph-osd node status (up/down, weight, IP)● CRUSH function specifying desired data distribution
● object storage daemons (OSDs)● safely replicate and store object● migrate data as the cluster changes over time● coordinate based on shared view of reality
● decentralized, distributed approach allows● massive scales (10,000s of servers or more)● the illusion of a single copy with consistent behavior
M
![Page 26: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/26.jpg)
CLIENTCLIENT
??
![Page 27: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/27.jpg)
![Page 28: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/28.jpg)
![Page 29: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/29.jpg)
CLIENT
??
![Page 30: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/30.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 31: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/31.jpg)
LIBRADOSLIBRADOS
MM
MM
MM
APPAPP
native
![Page 32: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/32.jpg)
LLLIBRADOS
• Provides direct access to RADOS for applications
• C, C++, Python, PHP, Java• No HTTP overhead
![Page 33: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/33.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 34: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/34.jpg)
MM
MM
MM
LIBRADOSLIBRADOS
RADOSGWRADOSGW
APPAPP
native
REST
LIBRADOSLIBRADOS
RADOSGWRADOSGW
APPAPP
![Page 35: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/35.jpg)
RADOS Gateway:• REST-based interface to
RADOS• Supports buckets,
accounting• Compatible with S3 and
Swift applications
![Page 36: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/36.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
![Page 37: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/37.jpg)
DISKDISK
COMPUTERCOMPUTER
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
DISKDISK
![Page 38: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/38.jpg)
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
VMVM
VMVM
VMVM
![Page 39: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/39.jpg)
MM
MM
MM
VMVM
LIBRADOSLIBRADOS
LIBRBDLIBRBD
VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER
![Page 40: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/40.jpg)
LIBRADOSLIBRADOS
MM
MM
MM
LIBRBDLIBRBD
CONTAINERCONTAINER
LIBRADOSLIBRADOS
LIBRBDLIBRBD
CONTAINERCONTAINERVMVM
![Page 41: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/41.jpg)
LIBRADOSLIBRADOS
MM
MM
MM
KRBD (KERNEL MODULE)KRBD (KERNEL MODULE)
HOSTHOST
![Page 42: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/42.jpg)
RADOS Block Device:• Storage of virtual disks in RADOS• Decouples VMs and containers
• Live migration!• Images are striped across the cluster• Snapshots!• Support in
• Qemu/KVM
• OpenStack, CloudStack
• Mainline Linux kernel
![Page 43: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/43.jpg)
HOW DO YOU
SPIN UP
THOUSANDS OF VMs
INSTANTLY
AND
EFFICIENTLY?
![Page 44: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/44.jpg)
144 0 0 0 0 = 144
instant copy
![Page 45: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/45.jpg)
4144
CLIENT
write
write
write
= 148
write
![Page 46: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/46.jpg)
4144
CLIENTread
read
read
= 148
![Page 47: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/47.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 48: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/48.jpg)
MM
MM
MM
CLIENTCLIENT
0110
0110
datametadata
![Page 49: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/49.jpg)
MM
MM
MM
![Page 50: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/50.jpg)
Metadata Server• Manages metadata for a
POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,
timestamps, mode, etc.)• Stores metadata in RADOS• Does not serve file data to
clients• Only required for shared
filesystem
![Page 51: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/51.jpg)
one tree
three metadata servers
??
![Page 52: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/52.jpg)
![Page 53: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/53.jpg)
![Page 54: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/54.jpg)
![Page 55: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/55.jpg)
![Page 56: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/56.jpg)
DYNAMIC SUBTREE PARTITIONING
![Page 57: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/57.jpg)
recursive accounting
● ceph-mds tracks recursive directory stats● file sizes ● file and directory counts● modification time
● virtual xattrs present full stats● efficient
$ ls alSh | headtotal 0drwxrxrx 1 root root 9.7T 20110204 15:51 .drwxrxrx 1 root root 9.7T 20101216 15:06 ..drwxrxrx 1 pomceph pg4194980 9.6T 20110224 08:25 pomcephdrwxrxrx 1 mcg_test1 pg2419992 23G 20110202 08:57 mcg_test1drwxx 1 luko adm 19G 20110121 12:17 lukodrwxx 1 eest adm 14G 20110204 16:29 eestdrwxrxrx 1 mcg_test2 pg2419992 3.0G 20110202 09:34 mcg_test2drwxx 1 fuzyceph adm 1.5G 20110118 10:46 fuzycephdrwxrxrx 1 dallasceph pg275 596M 20110114 10:06 dallasceph
![Page 58: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/58.jpg)
snapshots
● volume or subvolume snapshots unusable at petabyte scale● snapshot arbitrary subdirectories
● simple interface● hidden '.snap' directory● no special tools
$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot
![Page 59: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/59.jpg)
multiple protocols, implementations
● Linux kernel client● mount -t ceph 1.2.3.4:/ /mnt● export (NFS), Samba (CIFS)
● ceph-fuse● libcephfs.so
● your app● Samba (CIFS)● Ganesha (NFS)● Hadoop (map/reduce) kernel
libcephfs
ceph fuseceph-fuse
your app
libcephfsSamba
libcephfsGanesha
NFS SMB/CIFS
libcephfsHadoop
![Page 60: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/60.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLYAWESOME
AWESOMEAWESOME
AWESOME
AWESOME
![Page 61: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/61.jpg)
why we do this
● limited options for scalable open source storage ● proprietary solutions
● expensive● don't scale (well or out)● marry hardware and software
● industry needs to change
![Page 62: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/62.jpg)
who we are
● Ceph created at UC Santa Cruz (2007)● supported by DreamHost (2008-2011)● Inktank (2012)
● Los Angeles, Sunnyvale, San Francisco, remote
● growing user and developer community● Linux distros, users, cloud stacks, SIs, OEMs
http://ceph.com/
![Page 63: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/63.jpg)
thanks
BoF tonight @ 5:15
sage weil
@liewegas
http://github.com/ceph
http://ceph.com/
![Page 65: CloudOpen - 08/29/2012](https://reader033.fdocuments.us/reader033/viewer/2022051609/54809a94b4af9fbe158b5ede/html5/thumbnails/65.jpg)
why we like btrfs
● pervasive checksumming● snapshots, copy-on-write● efficient metadata (xattrs)● inline data for small files● transparent compression● integrated volume management
● software RAID, mirroring, error recovery● SSD-aware
● online fsck● active development community