Post on 16-Apr-2017
Building Tomorrow's Ceph
Sage Weil
Research beginnings
UCSC research grant
Petascale object storageDOE: LANL, LLNL, Sandia
Scalability
Reliability
PerformanceRaw IO bandwidth, metadata ops/sec
HPC file system workloadsThousands of clients writing to same file, directory
Distributed metadata management
Innovative designSubtree-based partitioning for locality, efficiency
Dynamically adapt to current workload
Embedded inodes
Prototype simulator in Java (2004)
First line of Ceph codeSummer internship at LLNL
High security national lab environment
Could write anything, as long as it was OSS
The rest of Ceph
RADOS distributed object storage cluster (2005)
EBOFS local object storage (2004/2006)
CRUSH hashing for the real world (2005)
Paxos monitors cluster consensus (2006)
emphasis on consistent, reliable storage scale by pushing intelligence to the edges a different but compelling architecture
Industry black hole
Many large storage vendorsProprietary solutions that don't scale well
Few open source alternatives (2006)Very limited scale, or
Limited community and architecture (Lustre)
No enterprise feature sets (snapshots, quotas)
PhD grads all built interesting systems......and then went to work for Netapp, DDN, EMC, Veritas.
They want you, not your project
A different path
Change the world with open sourceDo what Linux did to Solaris, Irix, Ultrix, etc.
What could go wrong?
LicenseGPL, BSD...
LGPL: share changes, okay to link to proprietary code
Avoid unsavory practicesDual licensing
Copyright assignment
Incubation
DreamHost!
Move back to LA, continue hacking
Hired a few developers
Pure development
No deliverables
Ambitious feature set
Native Linux kernel client (2007-)
Per-directory snapshots (2008)
Recursive accounting (2008)
Object classes (2009)
librados (2009)
radosgw (2009)
strong authentication (2009)
RBD: rados block device (2010)
The kernel client
ceph-fuse was limited, not very fast
Build native Linux kernel implementation
Began attending Linux file system developer events (LSF)Early words of encouragement from ex-Lustre devs
Engage Linux fs developer community as peer
Initial attempts merge rejected by LinusNot sufficient evidence of user demand
A few fans and would-be users chimed in...
Eventually merged for v2.6.34 (early 2010)
Part of a larger ecosystem
Ceph need not solve all problems as monolithic stack
Replaced ebofs object file system with btrfsSame design goals
Avoid reinventing the wheel
Robust, well-supported, well optimized
Kernel-level cache management
Copy-on-write, checksumming, other goodness
Contributed some early functionalityCloning files
Async snapshots
Budding community
#ceph on irc.oftc.net, ceph-devel@vger.kernel.org
Many interested users
A few developers
Many fans
Too unstable for any real deployments
Still mostly focused on right architecture and technical solutions
Road to product
DreamHost decides to build an S3-compatible object storage service with Ceph
StabilityFocus on core RADOS, RBD, radosgw
Paying back some technical debtBuild testing automation
Code review!
Expand engineering team
The reality
Growing incoming commercial interestEarly attempts from organizations large and small
Difficult to engage with a web hosting company
No means to support commercial deployments
Project needed a company to back itFund the engineering effort
Build and test a product
Support users
Bryan built a framework to spin out of DreamHost
Launch
Do it right
How do we build a strong open source company?
How do we build a strong open source community?
Models?RedHat, Cloudera, MySQL, Canonical,
Initial funding from DreamHost, Mark Shuttleworth
Goals
A stable Ceph release for production deploymentDreamObjects
Lay foundation for widespread adoptionPlatform support (Ubuntu, Redhat, SuSE)
Documentation
Build and test infrastructure
Build a sales and support organization
Expand engineering organization
Branding
Early decision to engage professional agencyMetaDesign
Terms likeBrand core
Design system
Project vs CompanyShared / Separate / Shared core
Inktank != Ceph
Aspirational messaging: The Future of Storage
Slick graphics
broken powerpoint template
Today: adoption
Traction
Too many production deployments to countWe don't know about most of them!
Too many customers (for me) to count
Growing partner listLots of inbound
Lots of press and buzz
Quality
Increased adoption means increased demands on robust testing
Across multiple platforms
Include platforms we don't like
UpgradesRolling upgrades
Inter-version compatibility
Expanding user community + less noise about bugs = a good sign
Developer community
Significant external contributors
First-class feature contributions from contributors
Non-Inktank participants in daily Inktank stand-ups
External access to build/test lab infrastructure
Common toolsetGithub
Email (kernel.org)
IRC (oftc.net)
Linux distros
CDS: Ceph Developer Summit
Community process for building project roadmap
100% onlineGoogle hangouts
Wikis
Etherpad
First was this Spring, second is next week
Great feedback, growing participation
Indoctrinating our own developers to an open development model
The Future
Governance
How do we strengthen the project community?
2014 is the year
Might formally acknowledge my role as BDL
Recognized project leadsRBD, RGW, RADOS, CephFS)
Formalize processes around CDS, community roadmap
External foundation?
Technical roadmap
How do we reach new use-cases and users
How do we better satisfy existing users
How do we ensure Ceph can succeed in enough markets for Inktank to thrive
Enough breadth to expand and grow the community
Enough focus to do well
Tiering
Client side caches are great, but only buy so much.
Can we separate hot and cold data onto different storage devices?Cache pools: promote hot objects from an existing pool into a fast (e.g., FusionIO) pool
Cold pools: demote cold data to a slow, archival pool (e.g., erasure coding)
How do you identify what is hot and cold?
Common in enterprise solutions; not found in open source scale-out systems
key topic at CDS next week
Erasure coding
Replication for redundancy is flexible and fast
For larger clusters, it can be expensive
Erasure coded data is hard to modify, but ideal for cold or read-only objectsCold storage tiering
Will be used directly by radosgw
Storage overheadRepair trafficMTTDL (days)
3x replication3x1x2.3 E10
RS (10, 4)1.4x10x3.3 E13
LRC (10, 6, 5)1.6x5x1.2 E15
Multi-datacenter, geo-replication
Ceph was originally designed for single DC clustersSynchronous replication
Strong consistency
Growing demandEnterprise: disaster recovery
ISPs: replication data across sites for locality
Two strategies:use-case specific: radosgw, RBD
low-level capability in RADOS
RGW: Multi-site and async replication
Multi-site, multi-clusterRegions: east coast, west coast, etc.
Zones: radosgw sub-cluster(s) within a region
Can federate across same or multiple Ceph clusters
Sync user and bucket metadata across regionsGlobal bucket/user namespace, like S3
Synchronize objects across zonesWithin the same region
Across regions
Admin control over which zones are master/slave
RBD: simple DR via snapshots
Simple backup capabilityBased on block device snapshots
Efficiently mirror changes between consecutive snapshots across clusters
Now supported/orchestrated by OpenStack
Good for coarse synchronization (e.g., hours)Not real-time
Async replication in RADOS
One implementation to capture multiple use-casesRBD, CephFS, RGW, RADOS
A harder problemScalable: 1000s OSDs 1000s of OSDs
Point-in-time consistency
Three challengesInfer a partial ordering of events in the cluster
Maintain a stable timeline to stream fromeither checkpoints or event stream
Coordinated roll-forward at destinationdo not apply any update until we know we have everything that happened before it
CephFS
This is where it all started let's get there
TodayQA coverage and bug squashing continues
NFS and CIFS now large complete and robust
NeedMulti-MDS
Directory fragmentation
Snapshots
QA investment
Amazing community effort
The larger ecosystem
Big data
When will be stop talking about MapReduce?Why is big data built on such a lame storage model?
Move computation to the data
Evangelize RADOS classes
librados case studies and proof points
Build a general purpose compute and storage platform
The enterprise
How do we pay for all our toys?
Support legacy and transitional interfacesiSCSI, NFS, pNFS, CIFS
Vmware, Hyper-v
Identify the beachhead use-casesOnly takes one use-case to get in the door
Earn others later
Single platform shared storage resource
Bottom-up: earn respect of engineers and admins
Top-down: strong brand and compelling product
Why we can beat the old guard
It is hard to compete with free and open source softwareUnbeatable value proposition
Ultimately a more efficient development model
It is hard to manufacture community
Strong foundational architecture
Native protocols, Linux kernel supportUnencumbered by legacy protocols like NFS
Move beyond traditional client/server model
Ongoing paradigm shiftSoftware defined infrastructure, data center
Thank you, and Welcome!
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles