Ceph: A decade in the making and still going strong

download Ceph: A decade in the making and still going strong

If you can't read please download the document

Transcript of Ceph: A decade in the making and still going strong

Ceph: a decade in the making and still going strong

Sage Weil

Today the part of Sage Weil will be played by...

RESEARCH

INCUBATION

RESEARCH

INKTANK

INCUBATION

RESEARCH

Research beginnings

RESEARCH

UCSC research grant

Petascale object storageDOE: LANL, LLNL, Sandia

Scalability, reliability, performance

HPC file system workloads

Scalable metadata management

First line of Ceph codeSummer internship at LLNL

High security national lab environment

Could write anything, as long as it was OSS

The rest of Ceph

RADOS distributed object storage cluster (2005)

EBOFS local object storage (2004/2006)

CRUSH hashing for the real world (2005)

Paxos monitors cluster consensus (2006)

emphasis on consistent, reliable storage scale by pushing intelligence to the edges a different but compelling architecture

Industry black hole

Many large storage vendorsProprietary solutions that don't scale well

Few open source alternatives (2006)Very limited scale, or

Limited community and architecture (Lustre)

No enterprise feature sets (snapshots, quotas)

PhD grads all built interesting systems......and then went to work for Netapp, DDN, EMC, Veritas.

They want you, not your project

A different path?

Change the storage world with open sourceDo what Linux did to Solaris, Irix, Ultrix, etc.

LicenseLGPL: share changes, okay to link to proprietary code

Avoid unfriendly practicesDual licensing

Copyright assignment

PlatformRemember sourceforge.net?

Incubation

INCUBATION

RESEARCH

DreamHost!

Move back to LA, continue hacking

Hired a few developers

Pure development

No deliverables

Ambitious feature set

Native Linux kernel client (2007-)

Per-directory snapshots (2008)

Recursive accounting (2008)

Object classes (2009)

librados (2009)

radosgw (2009)

strong authentication (2009)

RBD: rados block device (2010)

The kernel client

ceph-fuse was limited, not very fast

Build native Linux kernel implementation

Began attending Linux file system developer events (LSF)Early words of encouragement from ex-Lustre dev

Engage Linux fs developer community as peer

Initial attempts merge rejected by LinusNot sufficient evidence of user demand

A few fans and would-be users chimed in...

Eventually merged for v2.6.34 (early 2010)

Part of a larger ecosystem

Ceph need not solve all problems as monolithic stack

Replaced ebofs object file system with btrfsSame design goals; avoid reinventing the wheel

Robust, supported, well-optimized

Kernel-level cache management

Copy-on-write, checksumming, other goodness

Contributed some early functionalityCloning files

Async snapshots

Budding community

#ceph on irc.oftc.net, [email protected]

Many interested users

A few developers

Many fans

Too unstable for any real deployments

Still mostly focused on right architecture and technical solutions

Road to product

DreamHost decides to build an S3-compatible object storage service with Ceph

StabilityFocus on core RADOS, RBD, radosgw

Paying back some technical debtBuild testing automation

Code review!

Expand engineering team

The reality

Growing incoming commercial interestEarly attempts from organizations large and small

Difficult to engage with a web hosting company

No means to support commercial deployments

Project needed a company to back itFund the engineering effort

Build and test a product

Support users

Orchestrated a spin out of DreamHost in 2012

Inktank

INKTANK

INCUBATION

RESEARCH

Do it right

How do we build a strong open source company?

How do we build a strong open source community?

Models?Red Hat, SUSE, Cloudera, MySQL, Canonical,

Initial funding from DreamHost, Mark Shuttleworth

Goals

A stable Ceph release for production deploymentDreamObjects

Lay foundation for widespread adoptionPlatform support (Ubuntu, Red Hat, SUSE)

Documentation

Build and test infrastructure

Build a sales and support organization

Expand engineering organization

Branding

Early decision to engage professional agency

Terms likeBrand core

Design system

Company vs ProjectInktank != Ceph

Establish a healthy relationship with the community

Aspirational messaging: The Future of Storage

Slick graphics

broken powerpoint template

Traction

Too many production deployments to countWe don't know about most of them!

Too many customers (for me) to count

Growing partner list

Lots of buzz

OpenStack

Quality

Increased adoption means increased demands on robust testing

Across multiple platformsInclude platforms we don't use

UpgradesRolling upgrades

Inter-version compatibility

Developer community

Significant external contributors

First-class feature contributions from contributors

Non-Inktank participants in daily stand-ups

External access to build/test lab infrastructure

Common toolsetGithub

Email (kernel.org)

IRC (oftc.net)

Linux distros

CDS: Ceph Developer Summit

Community process for building project roadmap

100% onlineGoogle hangouts

Wikis

Etherpad

First was in Spring 2013, fifth is in two weeks

Great feedback, growing participation

Indoctrinating our own developers to an open development model

And then...

s/Red Hat of Storage/Storage of Red Hat/

Calamari

Inktank strategy was to package Ceph for the Enterprise

Inktank Ceph Enterprise (ICE)Ceph: a hardened, tested, validated version

Calamari: management layer and GUI (proprietary!)

Enterprise integrations: SNMP, HyperV, VMWare

Support SLAs

Red Hat model is pure open sourceOpen sourced Calamari

The Present

Tiering

Client side caches are great, but only buy so much.

Can we separate hot and cold data onto different storage devices?Cache pools: promote hot objects from an existing pool into a fast (e.g., FusionIO) pool

Cold pools: demote cold data to a slow, archival pool (e.g., erasure coding, NYI)

Very Cold Pools (efficient erasure coding, compression, osd spin down to save power) OR tape/public cloud

How do you identify what is hot and cold?

Common in enterprise solutions; not found in open source scale-out systems cache pools new in Firefly, better in Giant

Erasure coding

Replication for redundancy is flexible and fast

For larger clusters, it can be expensive

We can trade recovery performance for storage

Erasure coded data is hard to modify, but ideal for cold or read-only objectsCold storage tiering

Will be used directly by radosgw

Storage overheadRepair trafficMTTDL (days)

3x replication3x1x2.3 E10

RS (10, 4)1.4x10x3.3 E13

LRC (10, 6, 5)1.6x5x1.2 E15

Erasure coding (cont'd)

In firefly

LRC in Giant

Intel ISA-L (optimized library) in Giant, maybe backported to Firefly

Talk of ARM optimized (NEON) jerasure

Async Replication in RADOS

Clinic project with Harvey Mudd

Group of students working on real world project

Reason the bounds on clock drift so we can achieve point-in-time consistency across a distributed set of nodes

CephFS

Dogfooding for internal QA infrastructure

Learning lots

Many rough edges, but working quite well!

We want to hear from you!

The Future

CephFS

This is where it all started let's get there

TodayQA coverage and bug squashing continues

NFS and CIFS now large complete and robust

Multi-MDS stability continues to improve

NeedQA investment

Snapshot work

Amazing community effort

The larger ecosystem

Storage backends

Backends are pluggable

Recent work to use rocksdb everywhere leveldb can be used (mon/osd); can easily plug in other key/value store librariesOther possibilities include LMDB or NVNKV (from fusionIO)

Prototype kinetic backend

Alternative OSD backendsKeyValueStore put all data in a k/v db (Haomai @ UnitedStack)

KeyFileStore initial plans (2nd gen?)

Some partners looking at backends tuned to their hardware

Governance

How do we strengthen the project community?

Acknowledge Sage's role as maintainer / BDL

Recognize project leadsRBD, RGW, RADOS, CephFS, Calamari, etc.

Formalize processes around CDS, community roadmap

Formal foundation?Community build and test lab infrastructure

Build and test for broad range of OSs, distros, hardware

Technical roadmap

How do we reach new use-cases and users?

How do we better satisfy existing users?

How do we ensure Ceph can succeed in enough markets for business investment to thrive?

Enough breadth to expand and grow the community

Enough focus to do well

Performance

Lots of work with partners to improve performance

High-end flash back ends. Optimize hot paths to limit CPU usage, drive up IOPSImprove threading, fine-grained locks

Low-power processors. Run well on small ARM devices (including those new-fangled ethernet drives)

Ethernet Drives

Multiple vendors are building 'ethernet drives'

Normal hard drives w/ small ARM host on board

Could run OSD natively on the drive, completely remove the host from the deploymentMany different implementations, some vendors need help w/ open architecture and ecosystem concepts

Current devices are hard disks; no reason they couldn't also be flash-based, or hybrid

This is exactly what we were thinking when Ceph was originally designed!

Big data

Why is big data built on such a weak storage model?

Move computation to the data

Evangelize RADOS classes

librados case studies and proof points

Build a general purpose compute and storage platform

The enterprise

How do we pay for all our toys?

Support legacy and transitional interfacesiSCSI, NFS, pNFS, CIFS

Vmware, Hyper-v

Identify the beachhead use-casesOnly takes one use-case to get in the door

Single platform shared storage resource

Bottom-up: earn respect of engineers and admins

Top-down: strong brand and compelling product

Why we can beat the old guard

It is hard to compete with free and open source softwareUnbeatable value proposition

Ultimately a more efficient development model

It is hard to manufacture community

Strong foundational architecture

Native protocols, Linux kernel supportUnencumbered by legacy protocols like NFS

Move beyond traditional client/server model

Ongoing paradigm shiftSoftware defined infrastructure, data center

Thanks!

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles