Ceph: a decade in the making and still going strong
Sage Weil
Today the part of Sage Weil will be played by...
RESEARCH
INCUBATION
RESEARCH
INKTANK
INCUBATION
RESEARCH
Research beginnings
RESEARCH
UCSC research grant
Petascale object storageDOE: LANL, LLNL, Sandia
Scalability, reliability, performance
HPC file system workloads
Scalable metadata management
First line of Ceph codeSummer internship at LLNL
High security national lab environment
Could write anything, as long as it was OSS
The rest of Ceph
RADOS distributed object storage cluster (2005)
EBOFS local object storage (2004/2006)
CRUSH hashing for the real world (2005)
Paxos monitors cluster consensus (2006)
emphasis on consistent, reliable storage scale by pushing intelligence to the edges a different but compelling architecture
Industry black hole
Many large storage vendorsProprietary solutions that don't scale well
Few open source alternatives (2006)Very limited scale, or
Limited community and architecture (Lustre)
No enterprise feature sets (snapshots, quotas)
PhD grads all built interesting systems......and then went to work for Netapp, DDN, EMC, Veritas.
They want you, not your project
A different path?
Change the storage world with open sourceDo what Linux did to Solaris, Irix, Ultrix, etc.
LicenseLGPL: share changes, okay to link to proprietary code
Avoid unfriendly practicesDual licensing
Copyright assignment
PlatformRemember sourceforge.net?
Incubation
INCUBATION
RESEARCH
DreamHost!
Move back to LA, continue hacking
Hired a few developers
Pure development
No deliverables
Ambitious feature set
Native Linux kernel client (2007-)
Per-directory snapshots (2008)
Recursive accounting (2008)
Object classes (2009)
librados (2009)
radosgw (2009)
strong authentication (2009)
RBD: rados block device (2010)
The kernel client
ceph-fuse was limited, not very fast
Build native Linux kernel implementation
Began attending Linux file system developer events (LSF)Early words of encouragement from ex-Lustre dev
Engage Linux fs developer community as peer
Initial attempts merge rejected by LinusNot sufficient evidence of user demand
A few fans and would-be users chimed in...
Eventually merged for v2.6.34 (early 2010)
Part of a larger ecosystem
Ceph need not solve all problems as monolithic stack
Replaced ebofs object file system with btrfsSame design goals; avoid reinventing the wheel
Robust, supported, well-optimized
Kernel-level cache management
Copy-on-write, checksumming, other goodness
Contributed some early functionalityCloning files
Async snapshots
Budding community
#ceph on irc.oftc.net, [email protected]
Many interested users
A few developers
Many fans
Too unstable for any real deployments
Still mostly focused on right architecture and technical solutions
Road to product
DreamHost decides to build an S3-compatible object storage service with Ceph
StabilityFocus on core RADOS, RBD, radosgw
Paying back some technical debtBuild testing automation
Code review!
Expand engineering team
The reality
Growing incoming commercial interestEarly attempts from organizations large and small
Difficult to engage with a web hosting company
No means to support commercial deployments
Project needed a company to back itFund the engineering effort
Build and test a product
Support users
Orchestrated a spin out of DreamHost in 2012
Inktank
INKTANK
INCUBATION
RESEARCH
Do it right
How do we build a strong open source company?
How do we build a strong open source community?
Models?Red Hat, SUSE, Cloudera, MySQL, Canonical,
Initial funding from DreamHost, Mark Shuttleworth
Goals
A stable Ceph release for production deploymentDreamObjects
Lay foundation for widespread adoptionPlatform support (Ubuntu, Red Hat, SUSE)
Documentation
Build and test infrastructure
Build a sales and support organization
Expand engineering organization
Branding
Early decision to engage professional agency
Terms likeBrand core
Design system
Company vs ProjectInktank != Ceph
Establish a healthy relationship with the community
Aspirational messaging: The Future of Storage
Slick graphics
broken powerpoint template
Traction
Too many production deployments to countWe don't know about most of them!
Too many customers (for me) to count
Growing partner list
Lots of buzz
OpenStack
Quality
Increased adoption means increased demands on robust testing
Across multiple platformsInclude platforms we don't use
UpgradesRolling upgrades
Inter-version compatibility
Developer community
Significant external contributors
First-class feature contributions from contributors
Non-Inktank participants in daily stand-ups
External access to build/test lab infrastructure
Common toolsetGithub
Email (kernel.org)
IRC (oftc.net)
Linux distros
CDS: Ceph Developer Summit
Community process for building project roadmap
100% onlineGoogle hangouts
Wikis
Etherpad
First was in Spring 2013, fifth is in two weeks
Great feedback, growing participation
Indoctrinating our own developers to an open development model
And then...
s/Red Hat of Storage/Storage of Red Hat/
Calamari
Inktank strategy was to package Ceph for the Enterprise
Inktank Ceph Enterprise (ICE)Ceph: a hardened, tested, validated version
Calamari: management layer and GUI (proprietary!)
Enterprise integrations: SNMP, HyperV, VMWare
Support SLAs
Red Hat model is pure open sourceOpen sourced Calamari
The Present
Tiering
Client side caches are great, but only buy so much.
Can we separate hot and cold data onto different storage devices?Cache pools: promote hot objects from an existing pool into a fast (e.g., FusionIO) pool
Cold pools: demote cold data to a slow, archival pool (e.g., erasure coding, NYI)
Very Cold Pools (efficient erasure coding, compression, osd spin down to save power) OR tape/public cloud
How do you identify what is hot and cold?
Common in enterprise solutions; not found in open source scale-out systems cache pools new in Firefly, better in Giant
Erasure coding
Replication for redundancy is flexible and fast
For larger clusters, it can be expensive
We can trade recovery performance for storage
Erasure coded data is hard to modify, but ideal for cold or read-only objectsCold storage tiering
Will be used directly by radosgw
Storage overheadRepair trafficMTTDL (days)
3x replication3x1x2.3 E10
RS (10, 4)1.4x10x3.3 E13
LRC (10, 6, 5)1.6x5x1.2 E15
Erasure coding (cont'd)
In firefly
LRC in Giant
Intel ISA-L (optimized library) in Giant, maybe backported to Firefly
Talk of ARM optimized (NEON) jerasure
Async Replication in RADOS
Clinic project with Harvey Mudd
Group of students working on real world project
Reason the bounds on clock drift so we can achieve point-in-time consistency across a distributed set of nodes
CephFS
Dogfooding for internal QA infrastructure
Learning lots
Many rough edges, but working quite well!
We want to hear from you!
The Future
CephFS
This is where it all started let's get there
TodayQA coverage and bug squashing continues
NFS and CIFS now large complete and robust
Multi-MDS stability continues to improve
NeedQA investment
Snapshot work
Amazing community effort
The larger ecosystem
Storage backends
Backends are pluggable
Recent work to use rocksdb everywhere leveldb can be used (mon/osd); can easily plug in other key/value store librariesOther possibilities include LMDB or NVNKV (from fusionIO)
Prototype kinetic backend
Alternative OSD backendsKeyValueStore put all data in a k/v db (Haomai @ UnitedStack)
KeyFileStore initial plans (2nd gen?)
Some partners looking at backends tuned to their hardware
Governance
How do we strengthen the project community?
Acknowledge Sage's role as maintainer / BDL
Recognize project leadsRBD, RGW, RADOS, CephFS, Calamari, etc.
Formalize processes around CDS, community roadmap
Formal foundation?Community build and test lab infrastructure
Build and test for broad range of OSs, distros, hardware
Technical roadmap
How do we reach new use-cases and users?
How do we better satisfy existing users?
How do we ensure Ceph can succeed in enough markets for business investment to thrive?
Enough breadth to expand and grow the community
Enough focus to do well
Performance
Lots of work with partners to improve performance
High-end flash back ends. Optimize hot paths to limit CPU usage, drive up IOPSImprove threading, fine-grained locks
Low-power processors. Run well on small ARM devices (including those new-fangled ethernet drives)
Ethernet Drives
Multiple vendors are building 'ethernet drives'
Normal hard drives w/ small ARM host on board
Could run OSD natively on the drive, completely remove the host from the deploymentMany different implementations, some vendors need help w/ open architecture and ecosystem concepts
Current devices are hard disks; no reason they couldn't also be flash-based, or hybrid
This is exactly what we were thinking when Ceph was originally designed!
Big data
Why is big data built on such a weak storage model?
Move computation to the data
Evangelize RADOS classes
librados case studies and proof points
Build a general purpose compute and storage platform
The enterprise
How do we pay for all our toys?
Support legacy and transitional interfacesiSCSI, NFS, pNFS, CIFS
Vmware, Hyper-v
Identify the beachhead use-casesOnly takes one use-case to get in the door
Single platform shared storage resource
Bottom-up: earn respect of engineers and admins
Top-down: strong brand and compelling product
Why we can beat the old guard
It is hard to compete with free and open source softwareUnbeatable value proposition
Ultimately a more efficient development model
It is hard to manufacture community
Strong foundational architecture
Native protocols, Linux kernel supportUnencumbered by legacy protocols like NFS
Move beyond traditional client/server model
Ongoing paradigm shiftSoftware defined infrastructure, data center
Thanks!
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Top Related