Quick-and-Easy Deployment of a Ceph Storage Cluster

53
Quick & Easy Deployment of a Ceph Storage Cluster with SUSE Enterprise Storage Lars Marowsky-Brée Distinguished Engineer, Architect Storage & HA [email protected]

Transcript of Quick-and-Easy Deployment of a Ceph Storage Cluster

Page 1: Quick-and-Easy Deployment of a Ceph Storage Cluster

Quick & Easy Deployment of a Ceph Storage Cluster with SUSE Enterprise Storage

Lars Marowsky-BréeDistinguished Engineer, Architect Storage & HA

[email protected]

Page 2: Quick-and-Easy Deployment of a Ceph Storage Cluster

2

Agenda

• Introduction

• Setting the stage

• What is Ceph, and how does it work?

• Designing Ceph clusters

• Deploying Ceph

• Q & A

Page 3: Quick-and-Easy Deployment of a Ceph Storage Cluster

Software-defined storage market

Page 4: Quick-and-Easy Deployment of a Ceph Storage Cluster

4

2014: $1.5B - 2017: $2.8B - 22.5% CAGR (IDC)

By 2018, open-source storage will gain 20% of the market share, up fromless than 1% in 2013 (Gartner)

Page 5: Quick-and-Easy Deployment of a Ceph Storage Cluster

5

Enterprise Data Storage: Market Segments

Page 6: Quick-and-Easy Deployment of a Ceph Storage Cluster

6

Enterprise Data Capacity Utilization

50-60% of Enterprise Data

20-25%

15-20%

1-3%

Tier 0Ultra HighPerformance

Tier 1High-value, OLTP, Revenue Generating

Tier 2Backup/Recovery,Reference Data, Bulk Data

Tier 3Object, Archive,Compliance Archive,Long-term Retention

Source: Horison Information Strategies - Fred Moore

Page 7: Quick-and-Easy Deployment of a Ceph Storage Cluster

7

LOW FUNCTIONALITY

HIGHFUNCTIONALITY

ObjectStorage

ArchiveStorage

DataBackup

Video Audio

BigData

BulkStorage

DataAnalytics

OLTP

CRMERP

Email

HPC

ComplianceArchive

Enterprise Data Storage: Use Cases

CAPACITYOPTIMIZED

PERFORMANCEOPTIMIZED

VM-Aware

DRTarget

Page 8: Quick-and-Easy Deployment of a Ceph Storage Cluster

8

Ceph at SUSE

• 20+ years success with Linux & Enterprise storage‒ Ceph as technology preview in SUSE Cloud 3

‒ General support with SUSE Cloud 4

‒ Dedicated storage product since February 2015

• Fast growing team, still hiring!

• Focus areas:‒ Enterprise-readiness

‒ Ease of use

‒ Stabilization and integration

Page 9: Quick-and-Easy Deployment of a Ceph Storage Cluster

9

LOWFUNCTIONALITY

HIGHFUNCTIONALITY

ObjectStorage

ArchiveStorage

DataBackup

VideoAudio

BigData

BulkStorage

DataAnalytics

OLTP

CRMERP

Email

HPC

ComplianceArchive

SUSE Enterprise Storage: Initial Target Market

CAPACITYOPTIMIZED

PERFORMANCEOPTIMIZED

VM-Aware

DRTarget

Initial Target Market

Page 10: Quick-and-Easy Deployment of a Ceph Storage Cluster

10

SUSE Enterprise Storage Pricing Model

SES Base Configuration – Priority Subscription● SES and limited use of SUSE Linux Enterprise Server to provide:

● 4 SES storage OSD nodes (1-2 sockets)● 3 -5 SES MON nodes (customer selects redundancy level)● 1 SES management node

SES Expansion Node – Priority Subscription

Page 11: Quick-and-Easy Deployment of a Ceph Storage Cluster

What is Ceph?

Page 12: Quick-and-Easy Deployment of a Ceph Storage Cluster

12

From 10,000 Meters

[1] http://www.openstack.org/blog/2013/11/openstack-user-survey-statistics-november-2013/

• Open Source Distributed Storage solution

• Most popular choice of distributed storage for

OpenStack[1]

• Lots of goodies‒ Distributed Object Storage

‒ Redundancy

‒ Efficient Scale-Out

‒ Extensible

‒ Can be built on commodity hardware

‒ Lower operational cost

Page 13: Quick-and-Easy Deployment of a Ceph Storage Cluster

13

From 1,000 Meters

Object Storage(Like Amazon S3) Block Device File System

Unified Data Handling for 3 Purposes

●RESTful Interface●S3 and SWIFT APIs

●Block devices●Up to 16 EiB●Thin Provisioning●Snapshots

●POSIX Compliant●Separate Data and Metadata●For use e.g. with Hadoop

Autonomous, Redundant Storage Cluster

Page 14: Quick-and-Easy Deployment of a Ceph Storage Cluster

14

Component Names

radosgw

Object Storage

RBD

Block Device

Ceph FS

File System

RADOS

librados

DirectApplicationAccess toRADOS

Page 15: Quick-and-Easy Deployment of a Ceph Storage Cluster

15

For a Moment, Zooming to Atom Level

FS

Disk

OSD Object Storage Daemon

File System (btrfs, xfs)

Physical Disk

● OSDs serve storage objects to clients● Peer to perform replication and recovery

Page 16: Quick-and-Easy Deployment of a Ceph Storage Cluster

16

Put Several of These in One Node

FS

Disk

OSD

FS

Disk

OSD

FS

Disk

OSD

FS

Disk

OSD

FS

Disk

OSD

FS

Disk

OSD

Page 17: Quick-and-Easy Deployment of a Ceph Storage Cluster

17

Mix In a Few Monitor Nodes

M • Monitors are the brain cells of the cluster‒ Cluster Membership‒ Consensus for Distributed Decision Making

• Do not serve stored objects to clients

Page 18: Quick-and-Easy Deployment of a Ceph Storage Cluster

18

Voilà, a Small RADOS Cluster

M MM

Page 19: Quick-and-Easy Deployment of a Ceph Storage Cluster

19

Application Access

M MM

Application

librados

Application

librados

radosgw

Socket

Socket

REST

Page 20: Quick-and-Easy Deployment of a Ceph Storage Cluster

20

Linux Host

RADOS Block Device

M MM

M

krbd librados

Page 21: Quick-and-Easy Deployment of a Ceph Storage Cluster

21

RADOS Block Device - VMs

M MM

M

qemu-rbd librados

VM

KVM

Page 22: Quick-and-Easy Deployment of a Ceph Storage Cluster

22

Several Ingredients

• Basic Idea‒ Coarse grained partitioning of storage supports policy based

mapping (don't put all copies of my data in one rack)

‒ Topology map and Rules allow clients to “compute” the exact location of any storage object

‒ CRUSH: deterministic decentralized placement algorithm

• Three conceptual components‒ Objects / images

‒ Pools

‒ Placement groups

Page 23: Quick-and-Easy Deployment of a Ceph Storage Cluster

23

Pools

• A pool is a logical container for storage objects

• A pool has a set of parameters‒ a name

‒ a numerical ID (internal to RADOS)

‒ number of replicas

‒ number of placement groups

‒ placement rule set

‒ owner

• Pools support certain operations‒ create/remove/read/write entire objects

‒ snapshot of the entire pool

Page 24: Quick-and-Easy Deployment of a Ceph Storage Cluster

24

Placement Groups

• Placement groups help balance data across OSDs

• Consider a pool named “swimmingpool”‒ with a pool ID of 38 and 8192 placement groups (PGs)

• Consider object “rubberduck” in “swimmingpool” ‒ hash(“rubberduck”) % 8192 = 0xb0b

‒ The resulting PG is 38.b0b

• One PG typically spans several OSDs‒ for balancing

‒ for replication

• One OSD typically serves many PGs

Page 25: Quick-and-Easy Deployment of a Ceph Storage Cluster

25

CRUSH

• CRUSH uses a map of all OSDs in your cluster‒ includes physical topology, like row, rack, host

‒ includes rules describing which OSDs to consider for what type of pool/PG

• This map is maintained by the monitor nodes‒ Monitor nodes use standard cluster algorithms

for consensus building, etc

Page 26: Quick-and-Easy Deployment of a Ceph Storage Cluster

26

swimmingpool/rubberduck

Page 27: Quick-and-Easy Deployment of a Ceph Storage Cluster

27

CRUSH in Action: Writing

M MM

M

38.b0b

swimmingpool/rubberduck

Writes go to one OSD, which then propagates the

changes to other replicas

Page 28: Quick-and-Easy Deployment of a Ceph Storage Cluster

Self-Healing

Page 29: Quick-and-Easy Deployment of a Ceph Storage Cluster

29

Self-Healing

M MM

M

38.b0b

swimmingpool/rubberduck

Monitors detect a dead OSD

Page 30: Quick-and-Easy Deployment of a Ceph Storage Cluster

30

Self-Healing

M MM

M

38.b0b

swimmingpool/rubberduck

Monitors allocate other OSDs to PG and update

CRUSH map

Page 31: Quick-and-Easy Deployment of a Ceph Storage Cluster

31

Self-Healing

M MM

M

38.b0b

swimmingpool/rubberduck

Monitors initiate recovery

Page 32: Quick-and-Easy Deployment of a Ceph Storage Cluster

32

Self-Healing

M MM

M

38.b0b

swimmingpool/rubberduck

Future writes update the new

replica

Page 33: Quick-and-Easy Deployment of a Ceph Storage Cluster

33

Redundancy Choices

• Replication:‒ n number of exact, full-size

copies

‒ Potentially increased read performance due to striping

‒ More copies lower throughput, increase latency

‒ Increased cluster network utilization for writes

‒ Rebuilds can leverage multiple sources

‒ Significant capacity impact

• Erasure coding:‒ Data split into k parts plus m

redundancy codes

‒ Better space efficiency

‒ Higher CPU overhead

‒ Significant CPU and cluster network impact, especially during rebuild

‒ Cannot directly be used with block devices (see next slide)

‒ Plenty of algorithms to choose from

Page 34: Quick-and-Easy Deployment of a Ceph Storage Cluster

34

Cache Tiering

• Multi-tier storage architecture:‒ Pool acts as a transparent write-back overlay for another

‒ e.g., SSD 3-way replication over HDDs with erasure coding

‒ Additional configuration complexity and requires workload-specific tuning

‒ Also available: read-only mode (no write acceleration)

‒ Some downsides (not all features supported), memory consumption for HitSet

• A good way to combine the advantages of replication and erasure coding

Page 35: Quick-and-Easy Deployment of a Ceph Storage Cluster

Software Defined Storage and Performance

Page 36: Quick-and-Easy Deployment of a Ceph Storage Cluster

36

Legacy Storage Arrays

• Limits:‒ Tightly controlled

environment

‒ Limited scalability

‒ Few options

‒ Only certain approved drives

‒ Constrained number of disk slots

‒ Few memory variations

‒ Only very few networking choices

‒ Typically fixed controller and CPU

• Benefits:‒ Reasonably easy to

understand

‒ Long-term experience and “gut instincts”

‒ Somewhat deterministic behavior and pricing

Page 37: Quick-and-Easy Deployment of a Ceph Storage Cluster

37

Software Defined Storage (SDS)

• Limits:‒ ?

‒ (Essentially, all limits are merely bugs.)

• Benefits:‒ Infinite scalability

‒ Infinite adaptability

‒ Infinite choices

‒ Infinite flexibility

Page 38: Quick-and-Easy Deployment of a Ceph Storage Cluster

38

Properties of a SDS System

• Throughput

• Latency

• IOPS

• Single client or aggregate

• Availability

• Reliability

• Safety

• Cost

• Adaptability

• Flexibility

• Capacity

• Density

Page 39: Quick-and-Easy Deployment of a Ceph Storage Cluster

39

Architecting a SDS system

• These goals often conflict:‒ Availability versus Density

‒ IOPS versus Density

‒ Latency versus Throughput

‒ “Performance” during rebuild

‒ Everything versus Cost

• Many hardware options

• Software topology offers many configuration choices

• There is no one size fits all

Page 40: Quick-and-Easy Deployment of a Ceph Storage Cluster

40

Designing a Ceph cluster

• Understand your workload

• Make a best guess, based on the desirable properties and factors

• Build a 1-10% pilot / proof of concept‒ Preferably using loaner hardware from a vendor to avoid early

commitment.

• Refine and tune this until desired performance is achieved

• Scale up from there, re-tuning as you go

Page 41: Quick-and-Easy Deployment of a Ceph Storage Cluster

41

Measuring Ceph performance

• rados bench‒ Measures backend performance of the RADOS store

• rados load-gen‒ Generate configurable load on the cluster

• ceph tell osd.XX

• fio rbd backend‒ Swiss army knife of IO benchmarking on Linux

‒ Can also compare in-kernel rbd with user-space librados

• rest-bench‒ Measures S3/radosgw performance

Page 42: Quick-and-Easy Deployment of a Ceph Storage Cluster

Deploying a Ceph cluster

Page 43: Quick-and-Easy Deployment of a Ceph Storage Cluster

43

Assumptions

• You have three nodes:‒ node1, node2, node3

‒ These nodes will be both OSDs and MONitor nodes

‒ Each of these has /dev/vdb as a disk to be used for ceph

‒ admin

‒ A node from which you want to run calamari and host the web UI

Page 44: Quick-and-Easy Deployment of a Ceph Storage Cluster

44

Prepare the nodes

• Setup NTP

• Create a ceph user on each‒ Setup sudo for the ceph user

• Enable sshd and allow all nodes to login to each other‒ ssh-copy-id

• Install SUSE Linux Enterprise Server 12 and the SUSE Enterprise Storage 1.0 products on each

• Apply all available online updates

Page 45: Quick-and-Easy Deployment of a Ceph Storage Cluster

45

Deploy ceph!

• su - ceph

• ceph-deploy new node1 node2 node3

• ceph-deploy mon create-initial

• ceph-deploy osd create node1:vdb node2:vdb node3:vdb

• Sorry, we’re done ;-)

Page 46: Quick-and-Easy Deployment of a Ceph Storage Cluster

46

Deploy the calamari web frontend

• On the admin node:

• zypper in calamari-clients

• calamari-ctl initialize

• Open http://admin.node/

• ceph-deploy calamari –master admin –connect node1 node2 node3

Page 47: Quick-and-Easy Deployment of a Ceph Storage Cluster

47

Calamari Status screenshot

Page 48: Quick-and-Easy Deployment of a Ceph Storage Cluster

48

Performance overview

Page 49: Quick-and-Easy Deployment of a Ceph Storage Cluster

In closing

Page 50: Quick-and-Easy Deployment of a Ceph Storage Cluster

50

Summary: Why Ceph?

• Can be scaled arbitrarily

‒ No central bottleneck “master” servers

• Low operational cost

‒ Automation

‒ Commodity hardware

• Can move data close to application

• Redundancy through data replication

‒ Self-healing

‒ No need for RAID

• APIs and cloud integration for self-service

‒ Software defined storage

Page 51: Quick-and-Easy Deployment of a Ceph Storage Cluster

Thank you.

51

Questions and Answers?

Page 52: Quick-and-Easy Deployment of a Ceph Storage Cluster

Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany

+49 911 740 53 0 (Worldwide)www.suse.com

Join us on:www.opensuse.org

52

Page 53: Quick-and-Easy Deployment of a Ceph Storage Cluster

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.