Deployment Guide - SUSE Enterprise Storage 610.1 iSCSI Block Storage 115 The Linux Kernel iSCSI...

Deployment Guide

SUSE Enterprise Storage 6

Deployment GuideSUSE Enterprise Storage 6by Tomáš Bažant, Alexandra Settle, Liam Proven, and Sven Seeberg

Publication Date: 03/04/2021

SUSE LLC1800 South Novell PlaceProvo, UT 84606USA

https://documentation.suse.com

Copyright © 2021 SUSE LLC

Copyright © 2016, RedHat, Inc, and contributors.

The text of and illustrations in this document are licensed under a Creative Commons Attribution-Share

Alike 4.0 International ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommon-

s.org/licenses/by-sa/4.0/legalcode . In accordance with CC-BY-SA, if you distribute this document or an

adaptation of it, you must provide the URL for the original version.

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo,

and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux® is

the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered

trademark of Oracle and/or its affiliates. XFS® is a trademark of Silicon Graphics International Corp. or

its subsidiaries in the United States and/or other countries. All other trademarks are the property of their

respective owners.


http://creativecommons.org/licenses/by-sa/4.0/legalcode

http://creativecommons.org/licenses/by-sa/4.0/legalcode

For SUSE trademarks, see http://www.suse.com/company/legal/ . All other third-party trademarks are the

property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its

affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does

not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be

held liable for possible errors or the consequences thereof.

http://www.suse.com/company/legal/

Contents

About This Guide x

I SUSE ENTERPRISE STORAGE 1

1 SUSE Enterprise Storage 6 and Ceph 21.1 Ceph Features 2

1.2 Core Components 3

RADOS 3 • CRUSH 4 • Ceph Nodes and Daemons 5

1.3 Storage Structure 6

Pool 6 • Placement Group 7 • Example 7

1.4 BlueStore 8

1.5 Additional Information 10

2 Hardware Requirements and Recommendations 11

2.1 Network Overview 11

Network Recommendations 12

2.2 Multiple Architecture Configurations 14

2.3 Hardware Configuration 15

Minimum Cluster Configuration 15 • Recommended Production Cluster

Configuration 17

2.4 Object Storage Nodes 18

Minimum Requirements 18 • Minimum Disk Size 19 • Recommended

Size for the BlueStore's WAL and DB Device 19 • Using SSD for OSD

Journals 19 • Maximum Recommended Number of Disks 20

2.5 Monitor Nodes 20

2.6 Object Gateway Nodes 21

iv Deployment Guide

2.7 Metadata Server Nodes 21

2.8 Admin Node 21

2.9 iSCSI Nodes 22

2.10 SUSE Enterprise Storage 6 and Other SUSE Products 22

SUSE Manager 22

2.11 Naming Limitations 22

2.12 OSD and Monitor Sharing One Server 22

3 Admin Node HA Setup 24

3.1 Outline of the HA Cluster for Admin Node 24

3.2 Building a HA Cluster with Admin Node 25

4 User Privileges and Command Prompts 27

4.1 Salt/DeepSea Related Commands 27

4.2 Ceph Related Commands 27

4.3 General Linux Commands 28

4.4 Additional Information 28

II CLUSTER DEPLOYMENT AND UPGRADE 29

5 Deploying with DeepSea/Salt 305.1 Read the Release Notes 30

5.2 Introduction to DeepSea 31

Organization and Important Locations 32 • Targeting the Minions 33

5.3 Cluster Deployment 35

5.4 DeepSea CLI 45

DeepSea CLI: Monitor Mode 45 • DeepSea CLI: Stand-alone Mode 46

v Deployment Guide

5.5 Configuration and Customization 48

The policy.cfg File 48 • DriveGroups 53 • Adjusting ceph.conf with

Custom Settings 63

6 Upgrading from Previous Releases 64

6.1 General Considerations 64

6.2 Steps to Take before Upgrading the First Node 65

Read the Release Notes 65 • Verify Your Password 65 • Verify the

Previous Upgrade 65 • Upgrade Old RBD Kernel Clients 67 • Adjust

AppArmor 67 • Verify MDS Names 67 • Consolidate Scrub-related

Configuration 68 • Back Up Cluster Data 69 • Migrate from ntpd to

chronyd 69 • Patch Cluster Prior to Upgrade 71 • Verify the Current

Environment 73 • Check the Cluster's State 74 • Migrate OSDs to

BlueStore 75

6.3 Order in Which Nodes Must Be Upgraded 77

6.4 Oine Upgrade of CTDB Clusters 77

6.5 Per-Node Upgrade Instructions 78

Manual Node Upgrade Using the Installer DVD 79 • Node Upgrade Using the

SUSE Distribution Migration System 81

6.6 Upgrade the Admin Node 83

6.7 Upgrade Ceph Monitor/Ceph Manager Nodes 84

6.8 Upgrade Metadata Servers 84

6.9 Upgrade Ceph OSDs 86

6.10 Upgrade Gateway Nodes 89

6.11 Steps to Take after the Last Node Has Been Upgraded 91

Update Ceph Monitor Setting 91 • Enable the Telemetry Module 91

6.12 Update policy.cfg and Deploy Ceph Dashboard Using DeepSea 92

6.13 Migration from Profile-based Deployments to DriveGroups 94

Analyze the Current Layout 95 • Create DriveGroups Matching the Current

Layout 95 • OSD Deployment 96 • More Complex Setups 96

vi Deployment Guide

7 Customizing the Default Configuration 98

7.1 Using Customized Configuration Files 98

Disabling a Deployment Step 98 • Replacing a Deployment

Step 99 • Modifying a Deployment Step 100 • Modifying a Deployment

Stage 101 • Updates and Reboots during Stage 0 103

7.2 Modifying Discovered Configuration 104

Enabling IPv6 for Ceph Cluster Deployment 106

III INSTALLATION OF ADDITIONAL SERVICES 108

8 Installation of Services to Access your Data 109

9 Ceph Object Gateway 110

9.1 Object Gateway Manual Installation 110

Object Gateway Configuration 111

10 Installation of iSCSI Gateway 117

10.1 iSCSI Block Storage 117

The Linux Kernel iSCSI Target 118 • iSCSI Initiators 118

10.2 General Information about ceph-iscsi 119

10.3 Deployment Considerations 120

10.4 Installation and Configuration 121

Deploy the iSCSI Gateway to a Ceph Cluster 121 • Create RBD

Images 121 • Export RBD Images via iSCSI 122 • Authentication and

Access Control 123 • Advanced Settings 125

10.5 Exporting RADOS Block Device Images Using tcmu-runner 128

11 Installation of CephFS 130

11.1 Supported CephFS Scenarios and Guidance 130

11.2 Ceph Metadata Server 131

Adding and Removing a Metadata Server 131 • Configuring a Metadata

Server 131

vii Deployment Guide

11.3 CephFS 137

Creating CephFS 137 • MDS Cluster Size 138 • MDS Cluster and

Updates 139 • File Layouts 140

12 Installation of NFS Ganesha 145

12.1 Preparation 145

General Information 145 • Summary of Requirements 146

12.2 Example Installation 146

12.3 High Availability Active-Passive Configuration 147

Basic Installation 147 • Clean Up Resources 150 • Setting Up Ping

Resource 150 • Setting Up PortBlock Resource 151 • NFS Ganesha HA and

DeepSea 153

12.4 Active-Active Configuration 154

Prerequisites 154 • Configure NFS Ganesha 155 • Populate

the Cluster Grace Database 156 • Restart NFS Ganesha

Services 157 • Conclusion 157

12.5 More Information 157

IV CLUSTER DEPLOYMENT ON TOP OF SUSE CAAS PLATFORM 4(TECHNOLOGY PREVIEW) 158

13 SUSE Enterprise Storage 6 on Top of SUSE CaaSPlatform 4 Kubernetes Cluster 159

13.1 Considerations 159

13.2 Prerequisites 159

13.3 Get Rook Manifests 160

13.4 Installation 160

Configuration 160 • Create the Rook Operator 162 • Create the Ceph

Cluster 162

13.5 Using Rook as Storage for Kubernetes Workload 163

13.6 Uninstalling Rook 164

viii Deployment Guide

A Ceph Maintenance Updates Based on Upstream'Nautilus' Point Releases 165

Glossary 175

B Documentation Updates 178

B.1 Maintenance update of SUSE Enterprise Storage 6 documentation 178

B.2 June 2019 (Release of SUSE Enterprise Storage 6) 179

ix Deployment Guide

About This Guide

SUSE Enterprise Storage 6 is an extension to SUSE Linux Enterprise Server 15 SP1. It combinesthe capabilities of the Ceph (http://ceph.com/ ) storage project with the enterprise engineeringand support of SUSE. SUSE Enterprise Storage 6 provides IT organizations with the ability todeploy a distributed storage architecture that can support a number of use cases using commod-ity hardware platforms.

This guide helps you understand the concept of the SUSE Enterprise Storage 6 with the mainfocus on managing and administrating the Ceph infrastructure. It also demonstrates how to useCeph with other related solutions, such as OpenStack or KVM.

Many chapters in this manual contain links to additional documentation resources. These includeadditional documentation that is available on the system as well as documentation availableon the Internet.

For an overview of the documentation available for your product and the latest documentationupdates, refer to https://documentation.suse.com .

1 Available DocumentationThe following manuals are available for this product:

Book “Administration Guide”

The guide describes various administration tasks that are typically performed after theinstallation. The guide also introduces steps to integrate Ceph with virtualization solutionssuch as libvirt , Xen, or KVM, and ways to access objects stored in the cluster via iSCSIand RADOS gateways.

Deployment Guide

Guides you through the installation steps of the Ceph cluster and all services related toCeph. The guide also illustrates a basic Ceph cluster structure and provides you with relatedterminology.

HTML versions of the product manuals can be found in the installed system under /usr/share/doc/manual . Find the latest documentation updates at https://documentation.suse.com whereyou can download the manuals for your product in multiple formats.

x Available Documentation SES 6

http://ceph.com/



2 FeedbackSeveral feedback channels are available:

Bugs and Enhancement Requests

For services and support options available for your product, refer to http://www.suse.com/

support/ .To report bugs for a product component, log in to the Novell Customer Center from http://

www.suse.com/support/ and select My Support Service Request.

User Comments

We want to hear your comments and suggestions for this manual and the other documen-tation included with this product. If you have questions, suggestions, or corrections, con-tact [email protected], or you can also click the Report Documentation Bug link be-side each chapter or section heading.

Mail

For feedback on the documentation of this product, you can also send a mail to [email protected] . Make sure to include the document title, the product version, and thepublication date of the documentation. To report errors or suggest enhancements, providea concise description of the problem and refer to the respective section number and page(or URL).

3 Documentation ConventionsThe following typographical conventions are used in this manual:

/etc/passwd : directory names and le names

placeholder : replace placeholder with the actual value

PATH : the environment variable PATH

ls , --help : commands, options, and parameters

user : users or groups

Alt , Alt – F1 : a key to press or a key combination; keys are shown in uppercase as ona keyboard

xi Feedback SES 6

http://www.suse.com/support/




File, File Save As: menu items, buttons

Dancing Penguins (Chapter Penguins, ↑Another Manual): This is a reference to a chapter inanother manual.

4 About the Making of This ManualThis book is written in GeekoDoc, a subset of DocBook (see http://www.docbook.org ). TheXML source les were validated by xmllint , processed by xsltproc , and converted into XSL-FO using a customized version of Norman Walsh's stylesheets. The final PDF can be formattedthrough FOP from Apache or through XEP from RenderX. The authoring and publishing toolsused to produce this manual are available in the package daps . The DocBook Authoring andPublishing Suite (DAPS) is developed as open source software. For more information, see http://

daps.sf.net/ .

5 Ceph ContributorsThe Ceph project and its documentation is a result of the work of hundreds of contributors andorganizations. See https://ceph.com/contributors/ for more details.

xii About the Making of This Manual SES 6

http://www.docbook.org

http://daps.sf.net/

http://daps.sf.net/

https://ceph.com/contributors/

I SUSE Enterprise Storage

1 SUSE Enterprise Storage 6 and Ceph 2

2 Hardware Requirements and Recommendations 11

3 Admin Node HA Setup 24

4 User Privileges and Command Prompts 27

1 SUSE Enterprise Storage 6 and Ceph

SUSE Enterprise Storage 6 is a distributed storage system designed for scalability, reliability andperformance which is based on the Ceph technology. A Ceph cluster can be run on commodityservers in a common network like Ethernet. The cluster scales up well to thousands of servers(later on referred to as nodes) and into the petabyte range. As opposed to conventional systemswhich have allocation tables to store and fetch data, Ceph uses a deterministic algorithm toallocate storage for data and has no centralized information structure. Ceph assumes that instorage clusters the addition or removal of hardware is the rule, not the exception. The Cephcluster automates management tasks such as data distribution and redistribution, data replica-tion, failure detection and recovery. Ceph is both self-healing and self-managing which resultsin a reduction of administrative and budget overhead.

This chapter provides a high level overview of SUSE Enterprise Storage 6 and briey describesthe most important components.

TipSince SUSE Enterprise Storage 5, the only cluster deployment method is DeepSea. Referto Chapter 5, Deploying with DeepSea/Salt for details about the deployment process.

1.1 Ceph FeaturesThe Ceph environment has the following features:

Scalability

Ceph can scale to thousands of nodes and manage storage in the range of petabytes.

Commodity Hardware

No special hardware is required to run a Ceph cluster. For details, see Chapter 2, Hardware

Requirements and Recommendations

Self-managing

The Ceph cluster is self-managing. When nodes are added, removed or fail, the clusterautomatically redistributes the data. It is also aware of overloaded disks.

No Single Point of Failure

2 Ceph Features SES 6

No node in a cluster stores important information alone. The number of redundancies canbe configured.

Open Source Software

Ceph is an open source software solution and independent of specific hardware or vendors.

1.2 Core ComponentsTo make full use of Ceph's power, it is necessary to understand some of the basic components andconcepts. This section introduces some parts of Ceph that are often referenced in other chapters.

1.2.1 RADOS

The basic component of Ceph is called RADOS (Reliable Autonomic Distributed Object Store). It isresponsible for managing the data stored in the cluster. Data in Ceph is usually stored as objects.Each object consists of an identifier and the data.

RADOS provides the following access methods to the stored objects that cover many use cases:

Object Gateway

Object Gateway is an HTTP REST gateway for the RADOS object store. It enables directaccess to objects stored in the Ceph cluster.

RADOS Block Device

RADOS Block Devices (RBD) can be accessed like any other block device. These can beused for example in combination with libvirt for virtualization purposes.

CephFS

The Ceph File System is a POSIX-compliant le system.

librados

librados is a library that can be used with many programming languages to create anapplication capable of directly interacting with the storage cluster.

librados is used by Object Gateway and RBD while CephFS directly interfaces with RADOSFigure 1.1, “Interfaces to the Ceph Object Store”.

3 Core Components SES 6

librados

RADOSRADOS

RADOSGW RBD Ceph FS

App App VM /Host Client

FIGURE 1.1: INTERFACES TO THE CEPH OBJECT STORE

1.2.2 CRUSH

At the core of a Ceph cluster is the CRUSH algorithm. CRUSH is the acronym for ControlledReplication Under Scalable Hashing. CRUSH is a function that handles the storage allocation andneeds comparably few parameters. That means only a small amount of information is necessaryto calculate the storage position of an object. The parameters are a current map of the clusterincluding the health state, some administrator-defined placement rules and the name of theobject that needs to be stored or retrieved. With this information, all nodes in the Ceph clusterare able to calculate where an object and its replicas are stored. This makes writing or readingdata very efficient. CRUSH tries to evenly distribute data over all nodes in the cluster.

The CRUSH map contains all storage nodes and administrator-defined placement rules for storingobjects in the cluster. It defines a hierarchical structure that usually corresponds to the physicalstructure of the cluster. For example, the data-containing disks are in hosts, hosts are in racks,racks in rows and rows in data centers. This structure can be used to define failure domains. Cephthen ensures that replications are stored on different branches of a specific failure domain.

If the failure domain is set to rack, replications of objects are distributed over different racks.This can mitigate outages caused by a failed switch in a rack. If one power distribution unitsupplies a row of racks, the failure domain can be set to row. When the power distribution unitfails, the replicated data is still available on other rows.

4 CRUSH SES 6

1.2.3 Ceph Nodes and Daemons

In Ceph, nodes are servers working for the cluster. They can run several different types of dae-mons. We recommend to run only one type of daemon on each node, except for Ceph Manag-er daemons which can be collocated with Ceph Monitors. Each cluster requires at least CephMonitor, Ceph Manager, and Ceph OSD daemons:

Admin Node

Admin Node is a Ceph cluster node where the Salt master service is running. The AdminNode is a central point of the Ceph cluster because it manages the rest of the cluster nodesby querying and instructing their Salt minion services.

Ceph Monitor

Ceph Monitor (often abbreviated as MON) nodes maintain information about the clusterhealth state, a map of all nodes and data distribution rules (see Section 1.2.2, “CRUSH”).If failures or conflicts occur, the Ceph Monitor nodes in the cluster decide by majoritywhich information is correct. To form a qualified majority, it is recommended to have anodd number of Ceph Monitor nodes, and at least three of them.If more than one site is used, the Ceph Monitor nodes should be distributed over an oddnumber of sites. The number of Ceph Monitor nodes per site should be such that more than50% of the Ceph Monitor nodes remain functional if one site fails.

Ceph Manager

The Ceph Manager collects the state information from the whole cluster. The Ceph Man-ager daemon runs alongside the monitor daemons. It provides additional monitoring, andinterfaces the external monitoring and management systems. It includes other services aswell, for example the Ceph Dashboard Web UI. The Ceph Dashboard Web UI runs on thesame node as the Ceph Manager.The Ceph manager requires no additional configuration, beyond ensuring it is running.You can deploy it as a separate role using DeepSea.

Ceph OSD

A Ceph OSD is a daemon handling Object Storage Devices which are a physical or logicalstorage units (hard disks or partitions). Object Storage Devices can be physical disks/par-titions or logical volumes. The daemon additionally takes care of data replication and re-balancing in case of added or removed nodes.Ceph OSD daemons communicate with monitor daemons and provide them with the stateof the other OSD daemons.

5 Ceph Nodes and Daemons SES 6

To use CephFS, Object Gateway, NFS Ganesha, or iSCSI Gateway, additional nodes are required:

Metadata Server (MDS)

The Metadata Servers store metadata for the CephFS. By using an MDS you can executebasic le system commands such as ls without overloading the cluster.

Object Gateway

The Object Gateway is an HTTP REST gateway for the RADOS object store. It is compatiblewith OpenStack Swift and Amazon S3 and has its own user management.

NFS Ganesha

NFS Ganesha provides an NFS access to either the Object Gateway or the CephFS. It runsin the user instead of the kernel space and directly interacts with the Object Gateway orCephFS.

iSCSI Gateway

iSCSI is a storage network protocol that allows clients to send SCSI commands to SCSIstorage devices (targets) on remote servers.

Samba Gateway

The Samba Gateway provides a SAMBA access to data stored on CephFS.

1.3 Storage Structure

1.3.1 Pool

Objects that are stored in a Ceph cluster are put into pools. Pools represent logical partitionsof the cluster to the outside world. For each pool a set of rules can be defined, for example,how many replications of each object must exist. The standard configuration of pools is calledreplicated pool.

Pools usually contain objects but can also be configured to act similar to a RAID 5. In this con-figuration, objects are stored in chunks along with additional coding chunks. The coding chunkscontain the redundant information. The number of data and coding chunks can be defined bythe administrator. In this configuration, pools are referred to as erasure coded pools.

6 Storage Structure SES 6

1.3.2 Placement Group

Placement Groups (PGs) are used for the distribution of data within a pool. When creating a pool,a certain number of placement groups is set. The placement groups are used internally to groupobjects and are an important factor for the performance of a Ceph cluster. The PG for an objectis determined by the object's name.

1.3.3 Example

This section provides a simplified example of how Ceph manages data (see Figure 1.2, “Small

Scale Ceph Example”). This example does not represent a recommended configuration for a Cephcluster. The hardware setup consists of three storage nodes or Ceph OSDs ( Host 1 , Host 2 ,Host 3 ). Each node has three hard disks which are used as OSDs ( osd.1 to osd.9 ). The CephMonitor nodes are neglected in this example.

Note: Dierence between Ceph OSD and OSDWhile Ceph OSD or Ceph OSD daemon refers to a daemon that is run on a node, the wordOSD refers to the logical disk that the daemon interacts with.

The cluster has two pools, Pool A and Pool B . While Pool A replicates objects only two times,resilience for Pool B is more important and it has three replications for each object.

When an application puts an object into a pool, for example via the REST API, a PlacementGroup ( PG1 to PG4 ) is selected based on the pool and the object name. The CRUSH algorithmthen calculates on which OSDs the object is stored, based on the Placement Group that containsthe object.

In this example the failure domain is set to host. This ensures that replications of objects arestored on different hosts. Depending on the replication level set for a pool, the object is storedon two or three OSDs that are used by the Placement Group.

An application that writes an object only interacts with one Ceph OSD, the primary Ceph OSD.The primary Ceph OSD takes care of replication and confirms the completion of the write processafter all other OSDs have stored the object.

If osd.5 fails, all object in PG1 are still available on osd.1 . As soon as the cluster recognizesthat an OSD has failed, another OSD takes over. In this example osd.4 is used as a replacementfor osd.5 . The objects stored on osd.1 are then replicated to osd.4 to restore the replicationlevel.

7 Placement Group SES 6

Pool A(2 replications, 2 Placement Groups)

CRUSH algorithm

Host 1

osd.1 osd.2 osd.3

Host 2

osd.4 osd.5 osd.6

Host 3

osd.7 osd.8 osd.9

PG1osd.1osd.5 (failed)osd.4 (substitute)

PG2osd.3osd.8

Pool B(3 replications, 2 Placement Groups)

PG3osd.2osd.6osd.7

PG4osd.3osd.4osd.9

replicates data

Logical Structure

Hardware

Objects Objects

(failed)

FIGURE 1.2: SMALL SCALE CEPH EXAMPLE

If a new node with new OSDs is added to the cluster, the cluster map is going to change. TheCRUSH function then returns different locations for objects. Objects that receive new locationswill be relocated. This process results in a balanced usage of all OSDs.

1.4 BlueStoreBlueStore is a new default storage back-end for Ceph from SUSE Enterprise Storage 5. It hasbetter performance than FileStore, full data check-summing, and built-in compression.

BlueStore manages either one, two, or three storage devices. In the simplest case, BlueStoreconsumes a single primary storage device. The storage device is normally partitioned into twoparts:

1. A small partition named BlueFS that implements le system-like functionalities requiredby RocksDB.

2. The rest of the device is normally a large partition occupying the rest of the device. It ismanaged directly by BlueStore and contains all of the actual data. This primary device isnormally identified by a block symbolic link in the data directory.

8 BlueStore SES 6

It is also possible to deploy BlueStore across two additional devices:

A WAL device can be used for BlueStore’s internal journal or write-ahead log. It is identifiedby the block.wal symbolic link in the data directory. It is only useful to use a separate WALdevice if the device is faster than the primary device or the DB device, for example when:

The WAL device is an NVMe, and the DB device is an SSD, and the data device is eitherSSD or HDD.

Both the WAL and DB devices are separate SSDs, and the data device is an SSD or HDD.

A DB device can be used for storing BlueStore’s internal metadata. BlueStore (or rather, the em-bedded RocksDB) will put as much metadata as it can on the DB device to improve performance.Again, it is only helpful to provision a shared DB device if it is faster than the primary device.

Tip: Plan for the DB SizePlan thoroughly to ensure sufficient size of the DB device. If the DB device lls up, meta-data will spill over to the primary device, which badly degrades the OSD's performance.

You can check if a WAL/DB partition is getting full and spilling over with the cephdaemon osd.ID perf dump command. The slow_used_bytes value shows the amountof data being spilled out:

cephadm@adm > ceph daemon osd.ID perf dump | jq '.bluefs'"db_total_bytes": 1073741824,"db_used_bytes": 33554432,"wal_total_bytes": 0,"wal_used_bytes": 0,"slow_total_bytes": 554432,"slow_used_bytes": 554432,

9 BlueStore SES 6

1.5 Additional Information

Ceph as a community project has its own extensive online documentation. For topics notfound in this manual, refer to http://docs.ceph.com/docs/master/ .

The original publication CRUSH: Controlled, Scalable, Decentralized Placement of ReplicatedData by S.A. Weil, S.A. Brandt, E.L. Miller, C. Maltzahn provides helpful insight into theinner workings of Ceph. Especially when deploying large scale clusters it is a recommendedreading. The publication can be found at http://www.ssrc.ucsc.edu/papers/weil-sc06.pdf .

SUSE Enterprise Storage can be used with non-SUSE OpenStack distributions. The Cephclients need to be at a level that is compatible with SUSE Enterprise Storage.

NoteSUSE supports the server component of the Ceph deployment and the client is sup-ported by the OpenStack distribution vendor.

10 Additional Information SES 6

http://docs.ceph.com/docs/master/

http://www.ssrc.ucsc.edu/papers/weil-sc06.pdf

2 Hardware Requirements and Recommendations

The hardware requirements of Ceph are heavily dependent on the IO workload. The followinghardware requirements and recommendations should be considered as a starting point for de-tailed planning.

In general, the recommendations given in this section are on a per-process basis. If severalprocesses are located on the same machine, the CPU, RAM, disk and network requirements needto be added up.

2.1 Network OverviewCeph has several logical networks:

A trusted internal network, the back-end network called the the cluster network .

A public client network called public network .

Client networks for gateways, these are optional.

The trusted internal network is the back-end network between the OSD nodes for replication, re-balancing and recovery.Ideally, this network provides twice the bandwidth of the public networkwith default 3-way replication since the primary OSD sends 2 copies to other OSDs via thisnetwork. The public network is between clients and gateways on the one side to talk to monitors,managers, MDS nodes, OSD nodes. It is also used by monitors, managers, and MDS nodes totalk with OSD nodes.

11 Network Overview SES 6

FIGURE 2.1: NETWORK OVERVIEW

2.1.1 Network Recommendations

For the Ceph network environment, we recommend two bonded 25 GbE (or faster) network in-terfaces bonded using 802.3ad (LACP). The use of two network interfaces provides aggregationand fault-tolerance. The bond should then be used to provide two VLAN interfaces, one for thepublic network, and the second for the cluster network. Details on bonding the interfaces canbe found in https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-network.html#sec-net-

work-iface-bonding .

Fault tolerance can be enhanced through isolating the components into failure domains. To im-prove fault tolerance of the network, bonding one interface from two separate Network InterfaceCards (NIC) offers protection against failure of a single NIC. Similarly, creating a bond acrosstwo switches protects against failure of a switch. We recommend consulting with the networkequipment vendor in order to architect the level of fault tolerance required.

Important: Administration Network not SupportedAdditional administration network setup—that enables for example separating SSH, Salt,or DNS networking—is neither tested nor supported.

12 Network Recommendations SES 6

https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-network.html#sec-network-iface-bonding

https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-network.html#sec-network-iface-bonding

Tip: Nodes Configured via DHCPIf your storage nodes are configured via DHCP, the default timeouts may not be sufficientfor the network to be configured correctly before the various Ceph daemons start. Ifthis happens, the Ceph MONs and OSDs will not start correctly (running systemctlstatus ceph\* will result in "unable to bind" errors). To avoid this issue, we recommendincreasing the DHCP client timeout to at least 30 seconds on each node in your storagecluster. This can be done by changing the following settings on each node:

In /etc/sysconfig/network/dhcp , set

DHCLIENT_WAIT_AT_BOOT="30"

In /etc/sysconfig/network/config , set

WAIT_FOR_INTERFACES="60"

2.1.1.1 Adding a Private Network to a Running Cluster

If you do not specify a cluster network during Ceph deployment, it assumes a single publicnetwork environment. While Ceph operates ne with a public network, its performance andsecurity improves when you set a second private cluster network. To support two networks,each Ceph node needs to have at least two network cards.

You need to apply the following changes to each Ceph node. It is relatively quick to do for asmall cluster, but can be very time consuming if you have a cluster consisting of hundreds orthousands of nodes.

1. Stop Ceph related services on each cluster node.Add a line to /etc/ceph/ceph.conf to define the cluster network, for example:

cluster network = 10.0.0.0/24

If you need to specifically assign static IP addresses or override cluster network settings,you can do so with the optional cluster addr .

2. Check that the private cluster network works as expected on the OS level.

3. Start Ceph related services on each cluster node.

13 Network Recommendations SES 6

root # systemctl start ceph.target

2.1.1.2 Monitor Nodes on Dierent Subnets

If the monitor nodes are on multiple subnets, for example they are located in different rooms andserved by different switches, you need to adjust the ceph.conf le accordingly. For example,if the nodes have IP addresses 192.168.123.12, 1.2.3.4, and 242.12.33.12, add the followinglines to their global section:

[global] [...] mon host = 192.168.123.12, 1.2.3.4, 242.12.33.12 mon initial members = MON1, MON2, MON3 [...]

Additionally, if you need to specify a per-monitor public address or network, you need to adda [mon.X] section for each monitor:

[mon.MON1] public network = 192.168.123.0/24



2.2 Multiple Architecture ConfigurationsSUSE Enterprise Storage supports both x86 and Arm architectures. When considering each ar-chitecture, it is important to note that from a cores per OSD, frequency, and RAM perspective,there is no real difference between CPU architectures for sizing.

As with smaller x86 processors (non-server), lower-performance Arm-based cores may not pro-vide an optimal experience, especially when used for erasure coded pools.

NoteThroughout the documentation, SYSTEM-ARCH is used in place of x86 or Arm.

14 Multiple Architecture Configurations SES 6

2.3 Hardware ConfigurationFor the best product experience, we recommend to start with the recommended cluster config-uration. For a test cluster or a cluster with less performance requirements, we document a min-imal supported cluster configuration.

2.3.1 Minimum Cluster Configuration

A minimal product cluster configuration consists of:

At least four physical nodes (OSD nodes) with co-location of services

Dual-10 Gb Ethernet as a bonded network

A separate Admin Node (can be virtualized on an external node)

A detailed configuration is:

Separate Admin Node with 4 GB RAM, four cores, 1 TB storage capacity. This is typicallythe Salt master node. Ceph services and gateways, such as Ceph Monitor, Metadata Server,Ceph OSD, Object Gateway, or NFS Ganesha are not supported on the Admin Node as itneeds to orchestrate the cluster update and upgrade processes independently.

At least four physical OSD nodes, with eight OSD disks each, see Section 2.4.1, “Minimum

Requirements” for requirements.The total capacity of the cluster should be sized so that even with one node unavailable,the total used capacity (including redundancy) does not exceed 80%.

Three Ceph Monitor instances. Monitors need to be run from SSD/NVMe storage, not HDDs,for latency reasons.

Monitors, Metadata Server, and gateways can be co-located on the OSD nodes, see Sec-

tion 2.12, “OSD and Monitor Sharing One Server” for monitor co-location. If you co-locate ser-vices, the memory and CPU requirements need to be added up.

iSCSI Gateway, Object Gateway, and Metadata Server require at least incremental 4 GBRAM and four cores.

If you are using CephFS, S3/Swift, iSCSI, at least two instances of the respective roles(Metadata Server, Object Gateway, iSCSI) are required for redundancy and availability.

15 Hardware Configuration SES 6

The nodes are to be dedicated to SUSE Enterprise Storage and must not be used for anyother physical, containerized, or virtualized workload.

If any of the gateways (iSCSI, Object Gateway, NFS Ganesha, Metadata Server, ...) aredeployed within VMs, these VMs must not be hosted on the physical machines servingother cluster roles. (This is unnecessary, as they are supported as collocated services.)

When deploying services as VMs on hypervisors outside the core physical cluster, failuredomains must be respected to ensure redundancy.For example, do not deploy multiple roles of the same type on the same hypervisor, suchas multiple MONs or MDSs instances.

When deploying inside VMs, it is particularly crucial to ensure that the nodes have strongnetwork connectivity and well working time synchronization.

The hypervisor nodes must be adequately sized to avoid interference by other workloadsconsuming CPU, RAM, network, and storage resources.

FIGURE 2.2: MINIMUM CLUSTER CONFIGURATION

16 Minimum Cluster Configuration SES 6

2.3.2 Recommended Production Cluster Configuration

Once you grow your cluster, we recommend to relocate monitors, Metadata Server, and gatewayson separate nodes to ensure better fault tolerance.

Seven Object Storage Nodes

No single node exceeds ~15% of total storage.

The total capacity of the cluster should be sized so that even with one node unavail-able, the total used capacity (including redundancy) does not exceed 80%.

25 Gb Ethernet or better, bonded for internal cluster and external public networkeach.

56+ OSDs per storage cluster.

See Section 2.4.1, “Minimum Requirements” for further recommendation.

Dedicated physical infrastructure nodes.

Three Ceph Monitor nodes: 4 GB RAM, 4 core processor, RAID 1 SSDs for disk.See Section 2.5, “Monitor Nodes” for further recommendation.

Object Gateway nodes: 32 GB RAM, 8 core processor, RAID 1 SSDs for disk.See Section 2.6, “Object Gateway Nodes” for further recommendation.

iSCSI Gateway nodes: 16 GB RAM, 6-8 core processor, RAID 1 SSDs for disk.See Section 2.9, “iSCSI Nodes” for further recommendation.

Metadata Server nodes (one active/one hot standby): 32 GB RAM, 8 core processor,RAID 1 SSDs for disk.See Section 2.7, “Metadata Server Nodes” for further recommendation.

One SES Admin Node: 4 GB RAM, 4 core processor, RAID 1 SSDs for disk.

17 Recommended Production Cluster Configuration SES 6

2.4 Object Storage Nodes

2.4.1 Minimum Requirements

The following CPU recommendations account for devices independent of usage by Ceph:

1x 2GHz CPU Thread per spinner.

2x 2GHz CPU Thread per SSD.

4x 2GHz CPU Thread per NVMe.

Separate 10 GbE networks (public/client and internal), required 4x 10 GbE, recommended2x 25 GbE.

Total RAM required = number of OSDs x (1 GB + osd_memory_target ) + 16 GBThe default for osd_memory_target is 4 GB. Refer to Book “Administration Guide”, Chap-

ter 25 “Ceph Cluster Configuration”, Section 25.2.1 “Automatic Cache Sizing” for more details onosd_memory_target .

OSD disks in JBOD configurations or or individual RAID-0 configurations.

OSD journal can reside on OSD disk.

OSD disks should be exclusively used by SUSE Enterprise Storage.

Dedicated disk and SSD for the operating system, preferably in a RAID 1 configuration.

Allocate at least an additional 4 GB of RAM if this OSD host will host part of a cache poolused for cache tiering.

Ceph Monitors, gateway and Metadata Servers can reside on Object Storage Nodes.

For disk performance reasons, OSD nodes are bare metal nodes. No other workloads shouldrun on an OSD node unless it is a minimal setup of Ceph Monitors and Ceph Managers.

SSDs for Journal with 6:1 ratio SSD journal to OSD.

18 Object Storage Nodes SES 6

2.4.2 Minimum Disk Size

There are two types of disk space needed to run on OSD: the space for the disk journal (forFileStore) or WAL/DB device (for BlueStore), and the primary space for the stored data. Theminimum (and default) value for the journal/WAL/DB is 6 GB. The minimum space for data is5 GB, as partitions smaller than 5 GB are automatically assigned the weight of 0.

So although the minimum disk space for an OSD is 11 GB, we do not recommend a disk smallerthan 20 GB, even for testing purposes.

2.4.3 Recommended Size for the BlueStore's WAL and DB Device

Tip: More InformationRefer to Section 1.4, “BlueStore” for more information on BlueStore.

We recommend reserving 4 GB for the WAL device. The recommended size for DB is 64GB for most workloads.

If you intend to put the WAL and DB device on the same disk, then we recommend usinga single partition for both devices, rather than having a separate partition for each. Thisallows Ceph to use the DB device for the WAL operation as well. Management of the diskspace is therefore more effective as Ceph uses the DB partition for the WAL only if thereis a need for it. Another advantage is that the probability that the WAL partition gets fullis very small, and when it is not used fully then its space is not wasted but used for DBoperation.To share the DB device with the WAL, do not specify the WAL device, and specify onlythe DB device.Find more information about specifying an OSD layout in Section 5.5.2, “DriveGroups”.

2.4.4 Using SSD for OSD Journals

Solid-state drives (SSD) have no moving parts. This reduces random access time and read latencywhile accelerating data throughput. Because their price per 1MB is significantly higher than theprice of spinning hard disks, SSDs are only suitable for smaller storage.

OSDs may see a significant performance improvement by storing their journal on an SSD andthe object data on a separate hard disk.

19 Minimum Disk Size SES 6

Tip: Sharing an SSD for Multiple JournalsAs journal data occupies relatively little space, you can mount several journal directoriesto a single SSD disk. Keep in mind that with each shared journal, the performance of theSSD disk degrades. We do not recommend sharing more than six journals on the sameSSD disk and 12 on NVMe disks.

2.4.5 Maximum Recommended Number of Disks

You can have as many disks in one server as it allows. There are a few things to consider whenplanning the number of disks per server:

Network bandwidth. The more disks you have in a server, the more data must be transferredvia the network card(s) for the disk write operations.

Memory. RAM above 2 GB is used for the BlueStore cache. With the default osd_memo-ry_target of 4 GB, the system has a reasonable starting cache size for spinning media.If using SSD or NVME, consider increasing the cache size and RAM allocation per OSD tomaximize performance.

Fault tolerance. If the complete server fails, the more disks it has, the more OSDs the clustertemporarily loses. Moreover, to keep the replication rules running, you need to copy allthe data from the failed server among the other nodes in the cluster.

2.5 Monitor Nodes

At least three Ceph Monitor nodes are required. The number of monitors should alwaysbe odd (1+2n).

4 GB of RAM.

Processor with four logical cores.

An SSD or other sufficiently fast storage type is highly recommended for monitors, specifi-cally for the /var/lib/ceph path on each monitor node, as quorum may be unstable withhigh disk latencies. Two disks in RAID 1 configuration is recommended for redundancy.

20 Maximum Recommended Number of Disks SES 6

It is recommended that separate disks or at least separate disk partitions are used for themonitor processes to protect the monitor's available disk space from things like log lecreep.

There must only be one monitor process per node.

Mixing OSD, monitor, or Object Gateway nodes is only supported if sufficient hardwareresources are available. That means that the requirements for all services need to be addedup.

Two network interfaces bonded to multiple switches.

2.6 Object Gateway NodesObject Gateway nodes should have six to eight CPU cores and 32 GB of RAM (64 GB recom-mended). When other processes are co-located on the same machine, their requirements needto be added up.

2.7 Metadata Server NodesProper sizing of the Metadata Server nodes depends on the specific use case. Generally, the moreopen les the Metadata Server is to handle, the more CPU and RAM it needs. The following arethe minimum requirements:

3 GB of RAM for each Metadata Server daemon.

Bonded network interface.

2.5 GHz CPU with at least 2 cores.

2.8 Admin NodeAt least 4 GB of RAM and a quad-core CPU are required. This includes running the Salt masteron the Admin Node. For large clusters with hundreds of nodes, 6 GB of RAM is suggested.

21 Object Gateway Nodes SES 6

2.9 iSCSI NodesiSCSI nodes should have six to eight CPU cores and 16 GB of RAM.

2.10 SUSE Enterprise Storage 6 and Other SUSEProductsThis section contains important information about integrating SUSE Enterprise Storage 6 withother SUSE products.

2.10.1 SUSE Manager

SUSE Manager and SUSE Enterprise Storage are not integrated, therefore SUSE Manager cannotcurrently manage a SUSE Enterprise Storage cluster.

2.11 Naming LimitationsCeph does not generally support non-ASCII characters in configuration les, pool names, usernames and so forth. When configuring a Ceph cluster we recommend using only simple alphanu-meric characters (A-Z, a-z, 0-9) and minimal punctuation ('.', '-', '_') in all Ceph object/configu-ration names.

2.12 OSD and Monitor Sharing One ServerAlthough it is technically possible to run Ceph OSDs and Monitors on the same server in testenvironments, we strongly recommend having a separate server for each monitor node in pro-duction. The main reason is performance—the more OSDs the cluster has, the more I/O oper-ations the monitor nodes need to perform. And when one server is shared between a monitornode and OSD(s), the OSD I/O operations are a limiting factor for the monitor node.

Another consideration is whether to share disks between an OSD, a monitor node, and theoperating system on the server. The answer is simple: if possible, dedicate a separate disk toOSD, and a separate server to a monitor node.

22 iSCSI Nodes SES 6

Although Ceph supports directory-based OSDs, an OSD should always have a dedicated diskother than the operating system one.

TipIf it is really necessary to run OSD and monitor node on the same server, run the monitoron a separate disk by mounting the disk to the /var/lib/ceph/mon directory for slightlybetter performance.

23 OSD and Monitor Sharing One Server SES 6

3 Admin Node HA Setup

The Admin Node is a Ceph cluster node where the Salt master service runs. The Admin Node isa central point of the Ceph cluster because it manages the rest of the cluster nodes by queryingand instructing their Salt minion services. It usually includes other services as well, for examplethe Grafana dashboard backed by the Prometheus monitoring toolkit.

In case of Admin Node failure, you usually need to provide new working hardware for the nodeand restore the complete cluster configuration stack from a recent backup. Such a method istime consuming and causes cluster outage.

To prevent the Ceph cluster performance downtime caused by the Admin Node failure, werecommend making use of a High Availability (HA) cluster for the Ceph Admin Node.

3.1 Outline of the HA Cluster for Admin NodeThe idea of an HA cluster is that in case of one cluster node failing, the other node automaticallytakes over its role, including the virtualized Admin Node. This way, other Ceph cluster nodesdo not notice that the Admin Node failed.

The minimal HA solution for the Admin Node requires the following hardware:

Two bare metal servers able to run SUSE Linux Enterprise with the High Availability ex-tension and virtualize the Admin Node.

Two or more redundant network communication paths, for example via Network DeviceBonding.

Shared storage to host the disk image(s) of the Admin Node virtual machine. The sharedstorage needs to be accessible from both servers. It can be, for example, an NFS export,a Samba share, or iSCSI target.

Find more details on the cluster requirements at https://documentation.suse.com/sle-ha/15-SP1/

single-html/SLE-HA-install-quick/#sec-ha-inst-quick-req .

24 Outline of the HA Cluster for Admin Node SES 6

https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-install-quick/#sec-ha-inst-quick-req

https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-install-quick/#sec-ha-inst-quick-req

FIGURE 3.1: 2-NODE HA CLUSTER FOR ADMIN NODE

3.2 Building a HA Cluster with Admin NodeThe following procedure summarizes the most important steps of building the HA cluster forvirtualizing the Admin Node. For details, refer to the indicated links.

1. Set up a basic 2-node HA cluster with shared storage as described in https://documenta-

tion.suse.com/sle-ha/15-SP1/single-html/SLE-HA-install-quick/#art-sleha-install-quick .

2. On both cluster nodes, install all packages required for running the KVM hypervisorand the libvirt toolkit as described in https://documentation.suse.com/sles/15-SP1/sin-

gle-html/SLES-virtualization/#sec-vt-installation-kvm .

3. On the rst cluster node, create a new KVM virtual machine (VM) making use of lib-virt as described in https://documentation.suse.com/sles/15-SP1/single-html/SLES-virtual-

ization/#sec-libvirt-inst-virt-install . Use the preconfigured shared storage to store the diskimages of the VM.

4. After the VM setup is complete, export its configuration to an XML le on the sharedstorage. Use the following syntax:

root # virsh dumpxml VM_NAME > /path/to/shared/vm_name.xml

25 Building a HA Cluster with Admin Node SES 6

https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-install-quick/#art-sleha-install-quick

https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-install-quick/#art-sleha-install-quick

https://documentation.suse.com/sles/15-SP1/single-html/SLES-virtualization/#sec-vt-installation-kvm

https://documentation.suse.com/sles/15-SP1/single-html/SLES-virtualization/#sec-vt-installation-kvm

https://documentation.suse.com/sles/15-SP1/single-html/SLES-virtualization/#sec-libvirt-inst-virt-install

https://documentation.suse.com/sles/15-SP1/single-html/SLES-virtualization/#sec-libvirt-inst-virt-install

5. Create a resource for the Admin Node VM. Refer to https://documentation.suse.com/sle-

ha/15-SP1/single-html/SLE-HA-guide/#cha-conf-hawk2 for general info on creating HA re-sources. Detailed info on creating resources for a KVM virtual machine is described inhttp://www.linux-ha.org/wiki/VirtualDomain_%28resource_agent%29 .

6. On the newly-created VM guest, deploy the Admin Node including the additional servicesyou need there. Follow the relevant steps in Section 5.3, “Cluster Deployment”. At the sametime, deploy the remaining Ceph cluster nodes on the non-HA cluster servers.

26 Building a HA Cluster with Admin Node SES 6

https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-guide/#cha-conf-hawk2

https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-guide/#cha-conf-hawk2

http://www.linux-ha.org/wiki/VirtualDomain_%28resource_agent%29

4 User Privileges and Command Prompts

As a Ceph cluster administrator, you will be configuring and adjusting the cluster behavior byrunning specific commands. There are several types of commands you will need:

4.1 Salt/DeepSea Related CommandsThese commands help you to deploy or upgrade the Ceph cluster, run commands on several (orall) cluster nodes at the same time, or assist you when adding or removing cluster nodes. Themost frequently used are salt , salt-run , and deepsea . You need to run Salt commands onthe Salt master node (refer to Section 5.2, “Introduction to DeepSea” for details) as root . Thesecommands are introduced with the following prompt:

root@master #

For example:

root@master # salt '*.example.net' test.ping

4.2 Ceph Related CommandsThese are lower level commands to configure and ne tune all aspects of the cluster and itsgateways on the command line, for example ceph , rbd , radosgw-admin , or crushtool .

To run Ceph related commands, you need to have read access to a Ceph key. The key's capabil-ities then define your privileges within the Ceph environment. One option is to run Ceph com-mands as root (or via sudo ) and use the unrestricted default keyring 'ceph.client.admin.key'.

Safer and recommended option is to create a more restrictive individual key for each adminis-trator user and put it in a directory where the users can read it, for example:

~/.ceph/ceph.client.USERNAME.keyring

Tip: Path to Ceph KeysTo use a custom admin user and keyring, you need to specify the user name and path tothe key each time you run the ceph command using the -n client.USER_NAME and --keyring PATH/TO/KEYRING options.

27 Salt/DeepSea Related Commands SES 6

To avoid this, include these options in the CEPH_ARGS variable in the individual users'~/.bashrc les.

Although you can run Ceph related commands on any cluster node, we recommend runningthem on the Admin Node. This documentation uses the cephadm user to run the commands,therefore they are introduced with the following prompt:

cephadm@adm >

For example:

cephadm@adm > ceph auth list

Tip: Commands for Specific NodesIf the documentation instructs you to run a command on a cluster node with a specificrole, it will be addressed by the prompt. For example:

cephadm@mon >

4.3 General Linux CommandsLinux commands not related to Ceph or DeepSea, such as mount , cat , or openssl , are intro-duced either with the cephadm@adm > or root # prompts, depending on which privilegesthe related command requires.

4.4 Additional InformationFor more information on Ceph key management, refer to Book “Administration Guide”, Chapter 19

“Authentication with cephx”, Section 19.2 “Key Management”.

28 General Linux Commands SES 6

II Cluster Deployment and Upgrade

5 Deploying with DeepSea/Salt 30

6 Upgrading from Previous Releases 64

7 Customizing the Default Configuration 98

5 Deploying with DeepSea/Salt

Salt along with DeepSea is a stack of components that help you deploy and manage serverinfrastructure. It is very scalable, fast, and relatively easy to get running. Read the followingconsiderations before you start deploying the cluster with Salt:

Salt minions are the nodes controlled by a dedicated node called Salt master. Salt minionshave roles, for example Ceph OSD, Ceph Monitor, Ceph Manager, Object Gateway, iSCSIGateway, or NFS Ganesha.

A Salt master runs its own Salt minion. It is required for running privileged tasks—forexample creating, authorizing, and copying keys to minions—so that remote minions neverneed to run privileged tasks.

Tip: Sharing Multiple Roles per ServerYou will get the best performance from your Ceph cluster when each role is deployedon a separate node. But real deployments sometimes require sharing one node formultiple roles. To avoid trouble with performance and the upgrade procedure, donot deploy the Ceph OSD, Metadata Server, or Ceph Monitor role to the Admin Node.

Salt minions need to correctly resolve the Salt master's host name over the network. Bydefault, they look for the salt host name, but you can specify any other network-reach-able host name in the /etc/salt/minion le, see Section 5.3, “Cluster Deployment”.

5.1 Read the Release NotesIn the release notes you can nd additional information on changes since the previous releaseof SUSE Enterprise Storage. Check the release notes to see whether:

your hardware needs special considerations.

any used software packages have changed significantly.

special precautions are necessary for your installation.

The release notes also provide information that could not make it into the manual on time. Theyalso contain notes about known issues.

30 Read the Release Notes SES 6

After having installed the package release-notes-ses , nd the release notes locally in thedirectory /usr/share/doc/release-notes or online at https://www.suse.com/releasenotes/ .

5.2 Introduction to DeepSeaThe goal of DeepSea is to save the administrator time and confidently perform complex opera-tions on a Ceph cluster.

Ceph is a very configurable software solution. It increases both the freedom and responsibilityof system administrators.

The minimal Ceph setup is good for demonstration purposes, but does not show interestingfeatures of Ceph that you can see with a big number of nodes.

DeepSea collects and stores data about individual servers, such as addresses and device names.For a distributed storage system such as Ceph, there can be hundreds of such items to collectand store. Collecting the information and entering the data manually into a configuration man-agement tool is exhausting and error prone.

The steps necessary to prepare the servers, collect the configuration, and configure and deployCeph are mostly the same. However, this does not address managing the separate functions. Forday to day operations, the ability to trivially add hardware to a given function and remove itgracefully is a requirement.

DeepSea addresses these observations with the following strategy: DeepSea consolidates the ad-ministrator's decisions in a single le. The decisions include cluster assignment, role assignmentand profile assignment. And DeepSea collects each set of tasks into a simple goal. Each goalis a stage:

DEEPSEA STAGES DESCRIPTION

Stage 0—the preparation— during this stage, all required updates are applied and yoursystem may be rebooted.

Important: Re-run Stage 0 after the Admin Node RebootIf the Admin Node reboots during stage 0 to load the new kernel version, you needto run stage 0 again, otherwise minions will not be targeted.

31 Introduction to DeepSea SES 6

https://www.suse.com/releasenotes/

Stage 1—the discovery—here all hardware in your cluster is being detected and necessaryinformation for the Ceph configuration is being collected. For details about configuration,refer to Section 5.5, “Configuration and Customization”.

Stage 2—the configuration—you need to prepare configuration data in a particular for-mat.

Stage 3—the deployment—creates a basic Ceph cluster with mandatory Ceph services.See Section 1.2.3, “Ceph Nodes and Daemons” for their list.

Stage 4—the services—additional features of Ceph like iSCSI, Object Gateway and CephFScan be installed in this stage. Each is optional.

Stage 5—the removal stage. This stage is not mandatory and during the initial setup it isusually not needed. In this stage the roles of minions and also the cluster configuration areremoved. You need to run this stage when you need to remove a storage node from yourcluster. For details refer to Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”,

Section 2.3 “Removing and Reinstalling Cluster Nodes”.

5.2.1 Organization and Important Locations

Salt has several standard locations and several naming conventions used on your master node:

/srv/pillar

The directory stores configuration data for your cluster minions. Pillar is an interface forproviding global configuration values to all your cluster minions.

/srv/salt/

The directory stores Salt state les (also called sls les). State les are formatted descrip-tions of states in which the cluster should be.

/srv/module/runners

The directory stores Python scripts known as runners. Runners are executed on the masternode.

/srv/salt/_modules

The directory stores Python scripts that are called modules. The modules are applied toall minions in your cluster.

/srv/pillar/ceph

32 Organization and Important Locations SES 6

The directory is used by DeepSea. Collected configuration data are stored here.

/srv/salt/ceph

A directory used by DeepSea. It stores sls les that can be in different formats, but eachsubdirectory contains sls les. Each subdirectory contains only one type of sls le. Forexample, /srv/salt/ceph/stage contains orchestration les that are executed by salt-run state.orchestrate .

5.2.2 Targeting the Minions

DeepSea commands are executed via the Salt infrastructure. When using the salt command,you need to specify a set of Salt minions that the command will affect. We describe the set of theminions as a target for the salt command. The following sections describe possible methodsto target the minions.

5.2.2.1 Matching the Minion Name

You can target a minion or a group of minions by matching their names. A minion's name isusually the short host name of the node where the minion runs. This is a general Salt targetingmethod, not related to DeepSea. You can use globbing, regular expressions, or lists to limit therange of minion names. The general syntax follows:

root@master # salt target example.module

Tip: Ceph-only ClusterIf all Salt minions in your environment belong to your Ceph cluster, you can safely sub-stitute target with '*' to include all registered minions.

Match all minions in the example.net domain (assuming the minion names are identical to their"full" host names):

root@master # salt '*.example.net' test.ping

Match the 'web1' to 'web5' minions:

root@master # salt 'web[1-5]' test.ping

33 Targeting the Minions SES 6

Match both 'web1-prod' and 'web1-devel' minions using a regular expression:

root@master # salt -E 'web1-(prod|devel)' test.ping

Match a simple list of minions:

root@master # salt -L 'web1,web2,web3' test.ping

Match all minions in the cluster:

root@master # salt '*' test.ping

5.2.2.2 Targeting with a DeepSea Grain

In a heterogeneous Salt-managed environment where SUSE Enterprise Storage 6 is deployed ona subset of nodes alongside other cluster solutions, you need to mark the relevant minions byapplying a 'deepsea' grain to them before running DeepSea stage 0. This way, you can easilytarget DeepSea minions in environments where matching by the minion name is problematic.

To apply the 'deepsea' grain to a group of minions, run:

root@master # salt target grains.append deepsea default

To remove the 'deepsea' grain from a group of minions, run:

root@master # salt target grains.delval deepsea destructive=True

After applying the 'deepsea' grain to the relevant minions, you can target them as follows:

root@master # salt -G 'deepsea:*' test.ping

The following command is an equivalent:

root@master # salt -C 'G@deepsea:*' test.ping

5.2.2.3 Set the deepsea_minions Option

Setting the deepsea_minions option's target is a requirement for DeepSea deployments. Deep-Sea uses it to instruct minions during the execution of stages (refer to DeepSea Stages Description

for details.

34 Targeting the Minions SES 6

To set or change the deepsea_minions option, edit the /srv/pillar/ceph/deepsea_minion-s.sls le on the Salt master and add or replace the following line:

deepsea_minions: target

Tip: deepsea_minions TargetAs the target for the deepsea_minions option, you can use any targeting method: bothMatching the Minion Name and Targeting with a DeepSea Grain.

Match all Salt minions in the cluster:

deepsea_minions: '*'

Match all minions with the 'deepsea' grain:

deepsea_minions: 'G@deepsea:*'

5.2.2.4 For More Information

You can use more advanced ways to target minions using the Salt infrastructure. The 'deepsea-minions' manual page gives you more details about DeepSea targeting ( man 7 deepsea_min-ions ).

5.3 Cluster DeploymentThe cluster deployment process has several phases. First, you need to prepare all nodes of thecluster by configuring Salt and then deploy and configure Ceph.

Tip: Deploying Monitor Nodes without Defining OSD ProfilesIf you need to skip defining storage roles for OSD as described in Section 5.5.1.2, “Role

Assignment” and deploy Ceph Monitor nodes rst, you can do so by setting the DEV_ENVvariable.

This allows deploying monitors without the presence of the role-storage/ directory,as well as deploying a Ceph cluster with at least one storage, monitor, and manager role.

35 Cluster Deployment SES 6

To set the environment variable, either enable it globally by setting it in the /srv/pil-lar/ceph/stack/global.yml le, or set it for the current shell session only:

root@master # export DEV_ENV=true

As an example, /srv/pillar/ceph/stack/global.yml can be created with the follow-ing contents:

DEV_ENV: True

The following procedure describes the cluster preparation in detail.

1. Install and register SUSE Linux Enterprise Server 15 SP1 together with the SUSE EnterpriseStorage 6 extension on each node of the cluster.

2. Verify that proper products are installed and registered by listing existing software repos-itories. Run zypper lr -E and compare the output with the following list:

SLE-Product-SLES15-SP1-Pool SLE-Product-SLES15-SP1-Updates SLE-Module-Server-Applications15-SP1-Pool SLE-Module-Server-Applications15-SP1-Updates SLE-Module-Basesystem15-SP1-Pool SLE-Module-Basesystem15-SP1-Updates SUSE-Enterprise-Storage-6-Pool SUSE-Enterprise-Storage-6-Updates

3. Configure network settings including proper DNS name resolution on each node. The Saltmaster and all the Salt minions need to resolve each other by their host names. For more in-formation on configuring a network, see https://documentation.suse.com/sles/15-SP1/sin-

gle-html/SLES-admin/#sec-network-yast For more information on configuring a DNS serv-er, see https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#cha-dns .

ImportantIf cluster nodes are configured for multiple networks, DeepSea will use the networkto which their host names (or FQDN) resolves. Consider the following example /etc/hosts :

192.168.100.1 ses1.example.com ses1


https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#sec-network-yast

https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#sec-network-yast

https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#cha-dns

172.16.100.1 ses1clus.cluster.lan ses1clus

In the above example, the ses1 minion will resolve to the 192.168.100.x net-work and DeepSea will use this network as the public network. If the desired publicnetwork is 172.16.100.x , then host name should be changed to ses1clus .

4. Install the salt-master and salt-minion packages on the Salt master node:

root@master # zypper in salt-master salt-minion

Check that the salt-master service is enabled and started, and enable and start it ifneeded:

root@master # systemctl enable salt-master.serviceroot@master # systemctl start salt-master.service

5. If you intend to use firewall, verify that the Salt master node has ports 4505 and 4506open to all Salt minion nodes. If the ports are closed, you can open them using the yast2firewall command by allowing the SaltStack service.

Warning: DeepSea Stages Fail with FirewallDeepSea deployment stages fail when firewall is active (and even configured). Topass the stages correctly, you need to either turn the firewall o by running

root # systemctl stop firewalld.service

or set the FAIL_ON_WARNING option to 'False' in /srv/pillar/ceph/stack/glob-al.yml :

FAIL_ON_WARNING: False

6. Install the package salt-minion on all minion nodes.

root@minion > zypper in salt-minion

Make sure that the fully qualified domain name of each node can be resolved to the publicnetwork IP address by all other nodes.


7. Configure all minions (including the master minion) to connect to the master. If your Saltmaster is not reachable by the host name salt , edit the le /etc/salt/minion or createa new le /etc/salt/minion.d/master.conf with the following content:

master: host_name_of_salt_master

If you performed any changes to the configuration les mentioned above, restart the Saltservice on all Salt minions:

root@minion > systemctl restart salt-minion.service

8. Check that the salt-minion service is enabled and started on all nodes. Enable and startit if needed:

root # systemctl enable salt-minion.serviceroot # systemctl start salt-minion.service

9. Verify each Salt minion's fingerprint and accept all salt keys on the Salt master if thefingerprints match.

NoteIf the Salt minion fingerprint comes back empty, make sure the Salt minion has aSalt master configuration and it can communicate with the Salt master.

View each minion's fingerprint:

root@master # salt-call --local key.fingerlocal:3f:a3:2f:3f:b4:d3:d9:24:49:ca:6b:2c:e1:6c:3f:c3:83:37:f0:aa:87:42:e8:ff...

After gathering fingerprints of all the Salt minions, list fingerprints of all unaccepted min-ion keys on the Salt master:

root@master # salt-key -F[...]Unaccepted Keys:minion1:3f:a3:2f:3f:b4:d3:d9:24:49:ca:6b:2c:e1:6c:3f:c3:83:37:f0:aa:87:42:e8:ff...


If the minions' fingerprints match, accept them:

root@master # salt-key --accept-all

10. Verify that the keys have been accepted:

root@master # salt-key --list-all

11. By default, DeepSea uses the Admin Node as the time server for other cluster nodes. There-fore, if the Admin Node is not virtualized, select one or more time servers or pools, andsynchronize the local time against them. Verify that the time synchronization service is en-abled on each system start-up. Find more information on setting up time synchronizationin https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-ntp.html#sec-ntp-yast .If the Admin Node is a virtual machine, provide better time sources for the cluster nodesby overriding the default NTP client configuration:

1. Edit /srv/pillar/ceph/stack/global.yml on the Salt master node and add thefollowing line:

time_server: CUSTOM_NTP_SERVER

To add multiple time servers, the format is as follows:

time_server: - CUSTOM_NTP_SERVER1 - CUSTOM_NTP_SERVER2 - CUSTOM_NTP_SERVER3[...]

2. Refresh the Salt pillar:

root@master # salt '*' saltutil.pillar_refresh

3. Verify the changed value:

root@master # salt '*' pillar.items

4. Apply the new setting:

root@master # salt '*' state.apply ceph.time


https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-ntp.html#sec-ntp-yast

12. Prior to deploying SUSE Enterprise Storage 6, manually zap all the disks. Remember toreplace 'X' with the correct disk letter:

a. Stop all processes that are using the specific disk.

b. Verify whether any partition on the disk is mounted, and unmount if needed.

c. If the disk is managed by LVM, deactivate and delete the whole LVM infrastructure.Refer to https://www.mentation/sles-15/book_storage/data/cha_lvm.html for moredetails.

d. If the disk is part of MD RAID, deactivate the RAID. Refer to https://documen-

tation.suse.com/sles/15-SP1/single-html/SLES-storage/#part-software-raid for moredetails.

e. Tip: Rebooting the ServerIf you get error messages such as 'partition in use' or 'kernel cannot be updatedwith the new partition table' during the following steps, reboot the server.

Wipe the beginning of each partition (as root ):

for partition in /dev/sdX[0-9]*do dd if=/dev/zero of=$partition bs=4096 count=1 oflag=directdone

f. Wipe the beginning of the drive:

root # dd if=/dev/zero of=/dev/sdX bs=512 count=34 oflag=direct

g. Wipe the end of the drive:

root # dd if=/dev/zero of=/dev/sdX bs=512 count=33 \ seek=$((`blockdev --getsz /dev/sdX` - 33)) oflag=direct

h. Verify that the drive is empty (with no GPT structures) using:

root # parted -s /dev/sdX print free

or


https://www.mentation/sles-15/book_storage/data/cha_lvm.html

https://documentation.suse.com/sles/15-SP1/single-html/SLES-storage/#part-software-raid

https://documentation.suse.com/sles/15-SP1/single-html/SLES-storage/#part-software-raid

root # dd if=/dev/sdX bs=512 count=34 | hexdump -Croot # dd if=/dev/sdX bs=512 count=33 \ skip=$((`blockdev --getsz /dev/sdX` - 33)) | hexdump -C

13. Optionally, if you need to preconfigure the cluster's network settings before the deepseapackage is installed, create /srv/pillar/ceph/stack/ceph/cluster.yml manually andset the cluster_network: and public_network: options. Note that the le will not beoverwritten after you install deepsea . Then, run:

chown -R salt:salt /srv/pillar/ceph/stack

Tip: Enabling IPv6If you need to enable IPv6 network addressing, refer to Section 7.2.1, “Enabling IPv6

for Ceph Cluster Deployment”

14. Install DeepSea on the Salt master node:

root@master # zypper in deepsea

15. The value of the master_minion parameter is dynamically derived from the /etc/salt/minion_id le on the Salt master. If you need to override the discovered value, edit thele /srv/pillar/ceph/stack/global.yml and set a relevant value:

master_minion: MASTER_MINION_NAME

If your Salt master is reachable via more host names, use the Salt minion name for thestorage cluster as returned by the salt-key -L command. If you used the default hostname for your Salt master—salt—in the ses domain, then the le looks as follows:

master_minion: salt.ses

Now you deploy and configure Ceph. Unless specified otherwise, all steps are mandatory.

Note: Salt Command ConventionsThere are two possible ways to run salt-run state.orch—one is with'stage. STAGE_NUMBER ', the other is with the name of the stage. Both notations have thesame impact and it is fully your preference which command you use.


PROCEDURE 5.1: RUNNING DEPLOYMENT STAGES

1. Ensure the Salt minions belonging to the Ceph cluster are correctly targeted through thedeepsea_minions option in /srv/pillar/ceph/deepsea_minions.sls . Refer to Sec-

tion 5.2.2.3, “Set the deepsea_minions Option” for more information.

2. By default, DeepSea deploys Ceph clusters with tuned profiles active on Ceph Monitor,Ceph Manager, and Ceph OSD nodes. In some cases, you may need to deploy without tunedprofiles. To do so, put the following lines in /srv/pillar/ceph/stack/global.yml be-fore running DeepSea stages:

alternative_defaults: tuned_mgr_init: default-off tuned_mon_init: default-off tuned_osd_init: default-off

3. Optional: create Btrfs sub-volumes for /var/lib/ceph/ . This step needs to be executedbefore DeepSea stage.0. To migrate existing directories or for more details, see Book “Ad-

ministration Guide”, Chapter 33 “Hints and Tips”, Section 33.6 “Btrfs Subvolume for /var/lib/

ceph on Ceph Monitor Nodes”.Apply the following commands to each of the Salt minions:

root@master # salt 'MONITOR_NODES' saltutil.sync_allroot@master # salt 'MONITOR_NODES' state.apply ceph.subvolume

NoteThe Ceph.subvolume command creates /var/lib/ceph as a @/var/lib/cephBtrfs subvolume.

The new subvolume is now mounted and /etc/fstab is updated.

4. Prepare your cluster. Refer to DeepSea Stages Description for more details.

root@master # salt-run state.orch ceph.stage.0

or

root@master # salt-run state.orch ceph.stage.prep


Note: Run or Monitor Stages using DeepSea CLIUsing the DeepSea CLI, you can follow the stage execution progress in real-time,either by running the DeepSea CLI in the monitoring mode, or by running the stagedirectly through DeepSea CLI. For details refer to Section 5.4, “DeepSea CLI”.

5. The discovery stage collects data from all minions and creates configuration fragmentsthat are stored in the directory /srv/pillar/ceph/proposals . The data are stored inthe YAML format in *.sls or *.yml les.Run the following command to trigger the discovery stage:


or

root@master # salt-run state.orch ceph.stage.discovery

6. After the previous command finishes successfully, create a policy.cfg le in /srv/pillar/ceph/proposals . For details refer to Section 5.5.1, “The policy.cfg File”.

TipIf you need to change the cluster's network setting, edit /srv/pillar/ceph/stack/ceph/cluster.yml and adjust the lines starting with cluster_network:and public_network: .

7. The configuration stage parses the policy.cfg le and merges the included les intotheir final form. Cluster and role related content are placed in /srv/pillar/ceph/clus-ter , while Ceph specific content is placed in /srv/pillar/ceph/stack/default .Run the following command to trigger the configuration stage:


or

root@master # salt-run state.orch ceph.stage.configure


The configuration step may take several seconds. After the command finishes, you canview the pillar data for the specified minions (for example, named ceph_minion1 ,ceph_minion2 , etc.) by running:

root@master # salt 'ceph_minion*' pillar.items

Tip: Modifying OSD's LayoutIf you want to modify the default OSD's layout and change the drive groups config-uration, follow the procedure described in Section 5.5.2, “DriveGroups”.

Note: Overwriting DefaultsAs soon as the command finishes, you can view the default configuration andchange it to suit your needs. For details refer to Chapter 7, Customizing the Default

Configuration.

8. Now you run the deployment stage. In this stage, the pillar is validated, and the CephMonitor and Ceph OSD daemons are started:


or

root@master # salt-run state.orch ceph.stage.deploy

The command may take several minutes. If it fails, you need to x the issue and run theprevious stages again. After the command succeeds, run the following to check the status:

cephadm@adm > ceph -s

9. The last step of the Ceph cluster deployment is the services stage. Here you instantiateany of the currently supported services: iSCSI Gateway, CephFS, Object Gateway, and NFSGanesha. In this stage, the necessary pools, authorizing keyrings, and starting services arecreated. To start the stage, run the following:


or


root@master # salt-run state.orch ceph.stage.services

Depending on the setup, the command may run for several minutes.

10. Before you continue, we strongly recommend enabling the Ceph telemetry module. Formore information, see Book “Administration Guide”, Chapter 21 “Ceph Manager Modules”, Sec-

tion 21.2 “Telemetry Module” for information and instructions.

5.4 DeepSea CLIDeepSea also provides a command line interface (CLI) tool that allows the user to monitor orrun stages while visualizing the execution progress in real-time. Verify that the deepsea-clipackage is installed before you run the deepsea executable.

Two modes are supported for visualizing a stage's execution progress:

DEEPSEA CLI MODES

Monitoring mode: visualizes the execution progress of a DeepSea stage triggered by thesalt-run command issued in another terminal session.

Stand-alone mode: runs a DeepSea stage while providing real-time visualization of itscomponent steps as they are executed.

Important: DeepSea CLI CommandsThe DeepSea CLI commands can only be run on the Salt master node with the rootprivileges.

5.4.1 DeepSea CLI: Monitor Mode

The progress monitor provides a detailed, real-time visualization of what is happening duringexecution of stages using salt-run state.orch commands in other terminal sessions.

Tip: Start Monitor in a New Terminal SessionYou need to start the monitor in a new terminal window before running any salt-runstate.orch so that the monitor can detect the start of the stage's execution.

45 DeepSea CLI SES 6

If you start the monitor after issuing the salt-run state.orch command, then no executionprogress will be shown.

You can start the monitor mode by running the following command:

root@master # deepsea monitor

For more information about the available command line options of the deepsea monitor com-mand, check its manual page:

root@master # man deepsea-monitor

5.4.2 DeepSea CLI: Stand-alone Mode

In the stand-alone mode, DeepSea CLI can be used to run a DeepSea stage, showing its executionin real-time.

The command to run a DeepSea stage from the DeepSea CLI has the following form:

root@master # deepsea stage run stage-name

where stage-name corresponds to the way Salt orchestration state les are referenced. Forexample, stage deploy, which corresponds to the directory located in /srv/salt/ceph/stage/deploy , is referenced as ceph.stage.deploy.

This command is an alternative to the Salt-based commands for running DeepSea stages (or anyDeepSea orchestration state le).

The command deepsea stage run ceph.stage.0 is equivalent to salt-run state.orchceph.stage.0 .

For more information about the available command line options accepted by the deepsea stagerun command, check its manual page:

root@master # man deepsea-stage run

46 DeepSea CLI: Stand-alone Mode SES 6

In the following figure shows an example of the output of the DeepSea CLI when running Stage 2:

FIGURE 5.1: DEEPSEA CLI STAGE EXECUTION PROGRESS OUTPUT

5.4.2.1 DeepSea CLI stage run Alias

For advanced users of Salt, we also support an alias for running a DeepSea stage that takesthe Salt command used to run a stage, for example, salt-run state.orch stage-name , asa command of the DeepSea CLI.

Example:

root@master # deepsea salt-run state.orch stage-name

47 DeepSea CLI: Stand-alone Mode SES 6

5.5 Configuration and Customization

5.5.1 The policy.cfg File

The /srv/pillar/ceph/proposals/policy.cfg configuration le is used to determine rolesof individual cluster nodes. For example, which nodes act as Ceph OSDs or Ceph Monitors. Editpolicy.cfg in order to reflect your desired cluster setup. The order of the sections is arbitrary,but the content of included lines overwrites matching keys from the content of previous lines.

Tip: Examples of policy.cfgYou can nd several examples of complete policy les in the /usr/share/doc/pack-ages/deepsea/examples/ directory.

5.5.1.1 Cluster Assignment

In the cluster section you select minions for your cluster. You can select all minions, or you canblacklist or whitelist minions. Examples for a cluster called ceph follow.

To include all minions, add the following lines:

cluster-ceph/cluster/*.sls

To whitelist a particular minion:

cluster-ceph/cluster/abc.domain.sls

or a group of minions—you can shell glob matching:

cluster-ceph/cluster/mon*.sls

To blacklist minions, set the them to unassigned :

cluster-unassigned/cluster/client*.sls

48 Configuration and Customization SES 6

5.5.1.2 Role Assignment

This section provides you with details on assigning 'roles' to your cluster nodes. A 'role' in thiscontext means the service you need to run on the node, such as Ceph Monitor, Object Gateway,or iSCSI Gateway. No role is assigned automatically, only roles added to policy.cfg will bedeployed.

The assignment follows this pattern:

role-ROLE_NAME/PATH/FILES_TO_INCLUDE

Where the items have the following meaning and values:

ROLE_NAME is any of the following: 'master', 'admin', 'mon', 'mgr', 'storage', 'mds', 'igw','rgw', 'ganesha', 'grafana', or 'prometheus'.

PATH is a relative directory path to .sls or .yml les. In case of .sls les, it usually iscluster , while .yml les are located at stack/default/ceph/minions .

FILES_TO_INCLUDE are the Salt state les or YAML configuration les. They normallyconsist of Salt minions' host names, for example ses5min2.yml . Shell globbing can beused for more specific matching.

An example for each role follows:

master - the node has admin keyrings to all Ceph clusters. Currently, only a single Cephcluster is supported. As the master role is mandatory, always add a similar line to thefollowing:

role-master/cluster/master*.sls

admin - the minion will have an admin keyring. You define the role as follows:

role-admin/cluster/abc*.sls

mon - the minion will provide the monitor service to the Ceph cluster. This role requiresaddresses of the assigned minions. From SUSE Enterprise Storage 5, the public addressesare calculated dynamically and are no longer needed in the Salt pillar.

role-mon/cluster/mon*.sls

The example assigns the monitor role to a group of minions.

49 The policy.cfg File SES 6

mgr - the Ceph manager daemon which collects all the state information from the wholecluster. Deploy it on all minions where you plan to deploy the Ceph monitor role.

role-mgr/cluster/mgr*.sls

storage - use this role to specify storage nodes.

role-storage/cluster/data*.sls

mds - the minion will provide the metadata service to support CephFS.

role-mds/cluster/mds*.sls

igw - the minion will act as an iSCSI Gateway. This role requires addresses of the assignedminions, thus you need to also include the les from the stack directory:

role-igw/cluster/*.sls

rgw - the minion will act as an Object Gateway:

role-rgw/cluster/rgw*.sls

ganesha - the minion will act as an NFS Ganesha server. The 'ganesha' role requires eitheran 'rgw' or 'mds' role in cluster, otherwise the validation will fail in Stage 3.

role-ganesha/cluster/ganesha*.sls

To successfully install NFS Ganesha, additional configuration is required. If you want touse NFS Ganesha, read Chapter 12, Installation of NFS Ganesha before executing stages 2 and4. However, it is possible to install NFS Ganesha later.In some cases it can be useful to define custom roles for NFS Ganesha nodes. For details,see Book “Administration Guide”, Chapter 30 “NFS Ganesha: Export Ceph Data via NFS”, Section 30.3

“Custom NFS Ganesha Roles”.

grafana, prometheus - this node adds Grafana charts based on Prometheus alerting to theCeph Dashboard. Refer to Book “Administration Guide” for its detailed description.

role-grafana/cluster/grafana*.sls

role-prometheus/cluster/prometheus*.sls


Note: Multiple Roles of Cluster NodesYou can assign several roles to a single node. For example, you can assign the 'mds' rolesto the monitor nodes:

role-mds/cluster/mon[1,2]*.sls

5.5.1.3 Common Configuration

The common configuration section includes configuration les generated during the discovery(Stage 1). These configuration les store parameters like fsid or public_network . To includethe required Ceph common configuration, add the following lines:

config/stack/default/global.ymlconfig/stack/default/ceph/cluster.yml

5.5.1.4 Item Filtering

Sometimes it is not practical to include all les from a given directory with *.sls globbing. Thepolicy.cfg le parser understands the following filters:

Warning: Advanced TechniquesThis section describes filtering techniques for advanced users. When not used correctly,filtering can cause problems for example in case your node numbering changes.

slice=[start:end]

Use the slice filter to include only items start through end-1. Note that items in the givendirectory are sorted alphanumerically. The following line includes the third to fth lesfrom the role-mon/cluster/ subdirectory:

role-mon/cluster/*.sls slice[3:6]

re=regexp

Use the regular expression filter to include only items matching the given expressions. Forexample:

role-mon/cluster/mon*.sls re=.*1[135]\.subdomainX\.sls$


5.5.1.5 Example policy.cfg File

Following is an example of a basic policy.cfg le:

## Cluster Assignmentcluster-ceph/cluster/*.sls 1

## Roles# ADMINrole-master/cluster/examplesesadmin.sls 2

role-admin/cluster/sesclient*.sls 3

# MONrole-mon/cluster/ses-example-[123].sls 4

# MGRrole-mgr/cluster/ses-example-[123].sls 5

# STORAGErole-storage/cluster/ses-example-[5678].sls 6

# MDSrole-mds/cluster/ses-example-4.sls 7

# IGWrole-igw/cluster/ses-example-4.sls 8

# RGWrole-rgw/cluster/ses-example-4.sls 9

# COMMONconfig/stack/default/global.yml 10

config/stack/default/ceph/cluster.yml 11

1 Indicates that all minions are included in the Ceph cluster. If you have minions you do notwant to include in the Ceph cluster, use:

cluster-unassigned/cluster/*.slscluster-ceph/cluster/ses-example-*.sls

The rst line marks all minions as unassigned. The second line overrides minions matching'ses-example-*.sls', and assigns them to the Ceph cluster.

2 The minion called 'examplesesadmin' has the 'master' role. This, by the way, means it willget admin keys to the cluster.

3 All minions matching 'sesclient*' will get admin keys as well.


4 All minions matching 'ses-example-[123]' (presumably three minions: ses-example-1, ses-example-2, and ses-example-3) will be set up as MON nodes.

5 All minions matching 'ses-example-[123]' (all MON nodes in the example) will be set upas MGR nodes.

6 All minions matching 'ses-example-[5678]' will be set up as storage nodes.

7 Minion 'ses-example-4' will have the MDS role.

8 Minion 'ses-example-4' will have the IGW role.

9 Minion 'ses-example-4' will have the RGW role.

10 Means that we accept the default values for common configuration parameters such asfsid and public_network .

11 Means that we accept the default values for common configuration parameters such asfsid and public_network .

5.5.2 DriveGroups

DriveGroups specify the layouts of OSDs in the Ceph cluster. They are defined in a single le /srv/salt/ceph/configuration/files/drive_groups.yml .

An administrator should manually specify a group of OSDs that are interrelated (hybrid OSDsthat are deployed on solid state and spinners) or share the same deployment options (identical,for example same object store, same encryption option, stand-alone OSDs). To avoid explicitlylisting devices, DriveGroups use a list of filter items that correspond to a few selected eldsof ceph-volume 's inventory reports. In the simplest case this could be the 'rotational' ag (allsolid-state drives are to be db_devices, all rotating ones data devices) or something more involvedsuch as 'model' strings, or sizes. DeepSea will provide code that translates these DriveGroupsinto actual device lists for inspection by the user.

NoteNote that the filters use an OR gate to match against the drives.

53 DriveGroups SES 6

Following is a simple procedure that demonstrates the basic workflow when configuring Drive-Groups:

1. Inspect your disks' properties as seen by the ceph-volume command. Only these proper-ties are accepted by DriveGroups:

root@master # salt-run disks.details

2. Open the /srv/salt/ceph/configuration/files/drive_groups.yml YAML le andadjust to your needs. Refer to Section 5.5.2.1, “Specification”. Remember to use spaces insteadof tabs. Find more advanced examples in Section 5.5.2.4, “Examples”. The following exampleincludes all drives available to Ceph as OSDs:

default_drive_group_name: target: '*' data_devices: all: true

3. Verify new layouts:

root@master # salt-run disks.list

This runner returns you a structure of matching disks based on your DriveGroups. If youare not happy with the result, repeat the previous step.

Tip: Detailed ReportIn addition to the disks.list runner, there is a disks.report runner that printsout a detailed report of what will happen in the next DeepSea stage 3 invocation.

root@master # salt-run disks.report

4. Deploy OSDs. On the next DeepSea stage 3 invocation, the OSD disks will be deployedaccording to your DriveGroups specification.


5.5.2.1 Specification

/srv/salt/ceph/configuration/files/drive_groups.yml can take one of two basic forms,depending on whether BlueStore or FileStore is to be used. For BlueStore setups, drive_group-s.yml can be as follows:

drive_group_default_name: target: * data_devices: drive_spec: DEVICE_SPECIFICATION db_devices: drive_spec: DEVICE_SPECIFICATION wal_devices: drive_spec: DEVICE_SPECIFICATION block_wal_size: '5G' # (optional, unit suffixes permitted) block_db_size: '5G' # (optional, unit suffixes permitted) osds_per_device: 1 # number of osd daemons per device format: # 'bluestore' or 'filestore' (defaults to 'bluestore') encryption: # 'True' or 'False' (defaults to 'False')

For FileStore setups, drive_groups.yml can be as follows:

drive_group_default_name: target: * data_devices: drive_spec: DEVICE_SPECIFICATION journal_devices: drive_spec: DEVICE_SPECIFICATION format: filestore encryption: True

NoteIf you are unsure if your OSD is encrypted, see Book “Administration Guide”, Chapter 2 “Salt

Cluster Administration”, Section 2.5 “Verify an Encrypted OSD”.


5.5.2.2 Matching Disk Devices

You can describe the specification using the following filters:

By a disk model:

model: DISK_MODEL_STRING

By a disk vendor:

vendor: DISK_VENDOR_STRING

Tip: Lowercase Vendor StringAlways lowercase the DISK_VENDOR_STRING .

Whether a disk is rotational or not. SSDs and NVME drives are not rotational.

rotational: 0

Deploy a node using all available drives for OSDs:

data_devices: all: true

Additionally, by limiting the number of matching disks:

limit: 10

5.5.2.3 Filtering Devices by Size

You can filter disk devices by their size—either by an exact size, or a size range. The size:parameter accepts arguments in the following form:

'10G' - Includes disks of an exact size.

'10G:40G' - Includes disks whose size is within the range.

':10G' - Includes disks less than or equal to 10 GB in size.

'40G:' - Includes disks equal to or greater than 40 GB in size.


EXAMPLE 5.1: MATCHING BY DISK SIZE

drive_group_default: target: '*' data_devices: size: '40TB:' db_devices: size: ':2TB'

Note: Quotes RequiredWhen using the ':' delimiter, you need to enclose the size in quotes, otherwise the ':' signwill be interpreted as a new configuration hash.

Tip: Unit ShortcutsInstead of (G)igabytes, you can specify the sizes in (M)egabytes or (T)erabytes as well.

5.5.2.4 Examples

This section includes examples of different OSD setups.

EXAMPLE 5.2: SIMPLE SETUP

This example describes two nodes with the same setup:

20 HDDs

Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

2 SSDs

Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

The corresponding drive_groups.yml le will be as follows:

drive_group_default:


target: '*' data_devices: model: SSD-123-foo db_devices: model: MC-55-44-XZ

Such a configuration is simple and valid. The problem is that an administrator may adddisks from different vendors in the future, and these will not be included. You can improveit by reducing the filters on core properties of the drives:

drive_group_default: target: '*' data_devices: rotational: 1 db_devices: rotational: 0

In the previous example, we are enforcing all rotating devices to be declared as 'datadevices' and all non-rotating devices will be used as 'shared devices' (wal, db).

If you know that drives with more than 2 TB will always be the slower data devices, youcan filter by size:

drive_group_default: target: '*' data_devices: size: '2TB:' db_devices: size: ':2TB'

EXAMPLE 5.3: ADVANCED SETUP

This example describes two distinct setups: 20 HDDs should share 2 SSDs, while 10 SSDsshould share 2 NVMes.

20 HDDs

Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

12 SSDs


Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

2 NVMes

Vendor: Samsung

Model: NVME-QQQQ-987

Size: 256 GB

Such a setup can be defined with two layouts as follows:

drive_group: target: '*' data_devices: rotational: 0 db_devices: model: MC-55-44-XZ

drive_group_default: target: '*' data_devices: model: MC-55-44-XZ db_devices: vendor: samsung size: 256GB

Note that any drive of the size 256 GB and any drive from Samsung will match as a DBdevice with this example.

EXAMPLE 5.4: ADVANCED SETUP WITH NON-UNIFORM NODES

The previous examples assumed that all nodes have the same drives. However, that is notalways the case:

Nodes 1-5:

20 HDDs


Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

2 SSDs

Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

Nodes 6-10:

5 NVMes

Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

20 SSDs

Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

You can use the 'target' key in the layout to target specific nodes. Salt target notation helpsto keep things simple:

drive_group_node_one_to_five: target: 'node[1-5]' data_devices: rotational: 1 db_devices: rotational: 0

followed by

drive_group_the_rest: target: 'node[6-10]' data_devices:


model: MC-55-44-XZ db_devices: model: SSD-123-foo

EXAMPLE 5.5: EXPERT SETUP

All previous cases assumed that the WALs and DBs use the same device. It is howeverpossible to deploy the WAL on a dedicated device as well:

20 HDDs

Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

2 SSDs

Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

2 NVMes

Vendor: Samsung


Size: 256 GB

drive_group_default: target: '*' data_devices: model: MC-55-44-XZ db_devices: model: SSD-123-foo wal_devices: model: NVME-QQQQ-987

EXAMPLE 5.6: COMPLEX (AND UNLIKELY) SETUP

In the following setup, we are trying to define:

20 HDDs backed by 1 NVMe

2 HDDs backed by 1 SSD(db) and 1 NVMe(wal)


8 SSDs backed by 1 NVMe

2 SSDs stand-alone (encrypted)

1 HDD is spare and should not be deployed

The summary of used drives follows:

23 HDDs

Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

10 SSDs

Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

1 NVMe

Vendor: Samsung


Size: 256 GB

The DriveGroups definition will be the following:

drive_group_hdd_nvme: target: '*' data_devices: rotational: 0 db_devices: model: NVME-QQQQ-987

drive_group_hdd_ssd_nvme: target: '*' data_devices: rotational: 0 db_devices: model: MC-55-44-XZ


wal_devices: model: NVME-QQQQ-987

drive_group_ssd_nvme: target: '*' data_devices: model: SSD-123-foo db_devices: model: NVME-QQQQ-987

drive_group_ssd_standalone_encrypted: target: '*' data_devices: model: SSD-123-foo encryption: True

One HDD will remain as the le is being parsed from top to bottom.

5.5.3 Adjusting ceph.conf with Custom Settings

If you need to put custom settings into the ceph.conf configuration le, see Book “Administration

Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.14 “Adjusting ceph.conf with Custom Settings”

for more details.

63 Adjusting ceph.conf with Custom Settings SES 6

6 Upgrading from Previous Releases

This chapter introduces steps to upgrade SUSE Enterprise Storage 5.5 to version 6. Note thatversion 5.5 is basically 5 with all latest patches applied.

Note: Upgrade from Older Releases Not SupportedUpgrading from SUSE Enterprise Storage versions older than 5.5 is not supported. Yourst need to upgrade to the latest version of SUSE Enterprise Storage 5.5 and then followthe steps in this chapter.

6.1 General Considerations

If openATTIC is located on the Admin Node, it will be unavailable after you upgrade thenode. The new Ceph Dashboard will not be available until you deploy it by using DeepSea.

The cluster upgrade may take a long time—approximately the time it takes to upgrade onemachine multiplied by the number of cluster nodes.

A single node cannot be upgraded while running the previous SUSE Linux Enterprise Serverrelease, but needs to be rebooted into the new version's installer. Therefore the servicesthat the node provides will be unavailable for some time. The core cluster services willstill be available—for example if one MON is down during upgrade, there are still at leasttwo active MONs. Unfortunately, single instance services, such as a single iSCSI Gateway,will be unavailable.

64 General Considerations SES 6

6.2 Steps to Take before Upgrading the First Node

6.2.1 Read the Release Notes

In the SES 6 release notes, you can nd additional information on changes since the previousrelease of SUSE Enterprise Storage. Check the SES 6 release notes online to see whether:

Your hardware needs special considerations.

Any used software packages have changed significantly.

Special precautions are necessary for your installation.

You can nd SES 6 release notes online at https://www.suse.com/releasenotes/ .

6.2.2 Verify Your Password

Your password must be changed to meet SUSE Enterprise Storage 6 requirements. Ensure youchange the username and password on all initiators as well. For more information on changingyour password, see Section 10.4.4.3, “CHAP Authentication”.

6.2.3 Verify the Previous Upgrade

In case you previously upgraded from version 4, verify that the upgrade to version 5 was com-pleted successfully:

Check for the existence of the le

/srv/salt/ceph/configuration/files/ceph.conf.import

It is created by the import process during the upgrade from SES 4 to 5. Also, the config-uration_init: default-import option is set in the le /srv/pillar/ceph/propos-als/config/stack/default/ceph/cluster.yml

If configuration_init is still set to default-import , the cluster is using ceph.con-f.import as its configuration le and not DeepSea's default ceph.conf which is compiledfrom les in /srv/salt/ceph/configuration/files/ceph.conf.d/

65 Steps to Take before Upgrading the First Node SES 6

https://www.suse.com/releasenotes/

Therefore you need to inspect ceph.conf.import for any custom configuration, and pos-sibly move the configuration to one of the les in

/srv/salt/ceph/configuration/files/ceph.conf.d/

Then remove the configuration_init: default-import line from /srv/pil-

lar/ceph/proposals/config/stack/default/ceph/cluster.yml

Warning: Default DeepSea ConfigurationIf you do not merge the configuration from ceph.conf.import and remove theconfiguration_init: default-import option, any default configuration settingswe ship as part of DeepSea (stored in /srv/salt/ceph/configuration/files/ceph.conf.j2 ) will not be applied to the cluster.

Run the salt-run upgrade.check command to verify that the cluster uses the new buckettype straw2 , and that the Admin Node is not a storage node. The default is straw2 forany newly created buckets.

ImportantThe new straw2 bucket type fixes several limitations in the original straw buckettype. The previous straw buckets would change some mappings that should havechanged when a weight was adjusted. straw2 achieves the original goal of onlychanging mappings to or from the bucket item whose weight has changed.

Changing a bucket type from straw to straw2 results in a small amount of datamovement, depending on how much the bucket item weights vary from each other.When the weights are all the same, no data will move. When an item's weight variessignificantly there will be more movement. To migrate, execute:

cephadm@adm > ceph osd getcrushmap -o backup-crushmapcephadm@adm > ceph osd crush set-all-straw-buckets-to-straw2

If there are problems, you can revert this change with:

cephadm@adm > ceph osd setcrushmap -i backup-crushmap

66 Verify the Previous Upgrade SES 6

Moving to straw2 buckets unlocks a few recent features, such as the crush-com-pat balancer mode that was added in Ceph Luminous (SES 4).

Check that Ceph 'jewel' profile is used:

cephadm@adm > ceph osd crush dump | grep profile

6.2.4 Upgrade Old RBD Kernel Clients

In case old RBD kernel clients (older than SUSE Linux Enterprise Server 12 SP3) are being used,refer to Book “Administration Guide”, Chapter 23 “RADOS Block Device”, Section 23.9 “Mapping RBD Using

Old Kernel Clients”. We recommend upgrading old RBD kernel clients if possible.

6.2.5 Adjust AppArmor

If you used AppArmor in either 'complain' or 'enforce' mode, you need to set a Salt pillar variablebefore upgrading. Because SUSE Linux Enterprise Server 15 SP1 ships with AppArmor by default,AppArmor management was integrated into DeepSea stage 0. The default behavior in SUSEEnterprise Storage 6 is to remove AppArmor and related profiles. If you want to retain thebehavior configured in SUSE Enterprise Storage 5.5, verify that one of the following lines ispresent in the /srv/pillar/ceph/stack/global.yml le before starting the upgrade:

apparmor_init: default-enforce

or

apparmor_init: default-complain

6.2.6 Verify MDS Names

From SUSE Enterprise Storage 6, MDS names are no longer allowed to begin with a digit, andsuch names will cause MDS daemons to refuse to start. You can check whether your daemonshave such names either by running the ceph fs status command, or by restarting an MDSand checking its logs for the following message:

deprecation warning: MDS id '1mon1' is invalid and will be forbidden in

67 Upgrade Old RBD Kernel Clients SES 6

a future version. MDS names may not start with a numeric digit.

If you see the above message, the MDS names must be migrated before attempting to upgradeto SUSE Enterprise Storage 6. DeepSea provides an orchestration to automate such a migration.MDS names starting with a digit will be prepended with 'mds.':

root@master # salt-run state.orch ceph.mds.migrate-numerical-names

Tip: Custom Configuration Bound to MDS NamesIf you have configuration settings that are bound to MDS names and your MDS daemonshave names starting with a digit, verify that your configuration settings apply to the newnames as well (with the 'mds.' prefix). Consider the following example section in the /etc/ceph/ceph.conf le:

[mds.123-my-mds] # config setting specific to MDS name with a name starting with a digitmds cache memory limit = 1073741824mds standby for name = 456-another-mds

The ceph.mds.migrate-numerical-names orchestrator will change the MDS daemonname '123-my-mds' to 'mds.123-my-mds'. You need to adjust the configuration to reflectthe new name:

[mds.mds,123-my-mds] # config setting specific to MDS name with the new namemds cache memory limit = 1073741824mds standby for name = mds.456-another-mds

This will add MDS daemons with the new names before removing the old MDS daemons. Thenumber of MDS daemons will double for a short time. Clients will be able to access CephFS onlyafter a short pause for failover to happen. Therefore plan the migration for a time when youexpect little or no CephFS load.

6.2.7 Consolidate Scrub-related Configuration

The osd_scrub_max_interval and osd_scrub_max_interval settings are used by both OSDand MON daemons. OSDs use these settings to decide when to run scrub, and MONs use themto decide if a warning about scrub not running in time (running too long) should be shown.

68 Consolidate Scrub-related Configuration SES 6

Therefore, if non-default settings are used, they should be visible by both OSD and MON dae-mons (that is, defined either in both [osd] and [mon] sections, or in the [global] section),otherwise the monitor may give a false alarm.

In SES 5.5 the monitor warning are disabled by default and the issue may not be noticed if thesettings are overridden in the [osd] section only. But when the monitors are upgraded to SES6, it will start to complain, because in this version the warnings are enabled by default. So if youdefine non-default scrub settings in your configuration only in the [osd] section, it is desirableto move them to the [global] section before upgrading to SES 6 to avoid false alarms aboutscrub not running in time.

6.2.8 Back Up Cluster Data

Although creating backups of a cluster's configuration and data is not mandatory, we stronglyrecommend backing up important configuration les and cluster data. Refer to Book “Adminis-

tration Guide”, Chapter 3 “Backing Up Cluster Configuration and Data” for more details.

6.2.9 Migrate from ntpd to chronyd

SUSE Linux Enterprise Server 15 SP1 no longer uses ntpd to synchronize the local host time.Instead, chronyd is used. You need to migrate the time synchronization daemon on each clusternode. You can migrate to chronyd either before migrating the cluster, or upgrade the clusterand migrate to chronyd afterward.

WarningBefore you continue, review your current ntpd settings and determine if you want tokeep using the same time server. Keep in mind that the default behaviour will convertto using chronyd .

If you want to manually maintain the chronyd configuration, follow the instructionsbelow and ensure you disable ntpd time configuration. See Procedure 7.1, “Disabling Time

Synchronization” for more information.

69 Back Up Cluster Data SES 6

PROCEDURE 6.1: MIGRATE TO chronyd BEFORE THE CLUSTER UPGRADE

1. Install the chrony package:

root@minion > zypper install chrony

2. Edit the chronyd configuration le /etc/chrony.conf and add NTP sources from thecurrent ntpd configuration in /etc/ntp.conf .

Tip: More Details on chronyd ConfigurationRefer to https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-ntp.html

to nd more details about how to include time sources in chronyd configuration.

3. Disable and stop the ntpd service:

root@minion > systemctl disable ntpd.service && systemctl stop ntpd.service

4. Start and enable the chronyd service:

root@minion > systemctl start chronyd.service && systemctl enable chronyd.service

5. Verify the status of chronyd :

root@minion > chronyc tracking

PROCEDURE 6.2: MIGRATE TO chronyd AFTER THE CLUSTER UPGRADE

1. During cluster upgrade, add the following software repositories:

SLE-Module-Legacy15-SP1-Pool

SLE-Module-Legacy15-SP1-Updates

2. Upgrade the cluster to version 6.

3. Edit the chronyd configuration le /etc/chrony.conf and add NTP sources from thecurrent ntpd configuration in /etc/ntp.conf .

Tip: More Details on chronyd ConfigurationRefer to https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-ntp.html

to nd more details about how to include time sources in chronyd configuration.

70 Migrate from ntpd to chronyd SES 6

https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-ntp.html

https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-ntp.html

4. Disable and stop the ntpd service:

root@minion > systemctl disable ntpd.service && systemctl stop ntpd.service

5. Start and enable the chronyd service:

root@minion > systemctl start chronyd.service && systemctl enable chronyd.service

6. Migrate from ntpd to chronyd .

7. Verify the status of chronyd :

root@minion > chronyc tracking

8. Remove the legacy software repositories that you added to keep ntpd in the system duringthe upgrade process.

6.2.10 Patch Cluster Prior to Upgrade

Apply the latest patches to all cluster nodes prior to upgrade.

6.2.10.1 Required Software Repositories

Check that required repositories are configured on each host of the cluster. To list all availablerepositories, run

root@minion > zypper lr

Important: Remove SUSE Enterprise Storage 5.5 LTSSRepositoriesUpgrades will fail if LTSS repositories are configured in SUSE Enterprise Storage 5.5. Findtheir IDs and remove them from the system. For example:

root # zypper lr[...]12 | SUSE_Linux_Enterprise_Server_LTSS_12_SP3_x86_64:SLES12-SP3-LTSS-Debuginfo-Updates13 | SUSE_Linux_Enterprise_Server_LTSS_12_SP3_x86_64:SLES12-SP3-LTSS-Updates[...]

71 Patch Cluster Prior to Upgrade SES 6

root # zypper rr 12 13

Tip: Upgrade Without Using SCC, SMT, or RMTIf your nodes are not subscribed to one of the supported software channel providers thathandle automatic channel adjustment—such as SMT, RMT, or SCC—you may need toenable additional software modules and channels.

SUSE Enterprise Storage 5.5 requires:

SLES12-SP3-Installer-Updates

SLES12-SP3-Pool

SLES12-SP3-Updates

SUSE-Enterprise-Storage-5-Pool

SUSE-Enterprise-Storage-5-Updates

NFS/SMB Gateway on SLE-HA on SUSE Linux Enterprise Server 12 SP3 requires:

SLE-HA12-SP3-Pool

SLE-HA12-SP3-Updates

6.2.10.2 Repository Staging Systems

If you are using one of the repository staging systems—SMT, or RMT—create a new frozen patchlevel for the current and the new SUSE Enterprise Storage version.

Find more information in:

https://documentation.suse.com/sles/12-SP5/single-html/SLES-smt/#book-smt ,

https://documentation.suse.com/sles/15-SP1/single-html/SLES-rmt/#book-rmt .

https://documentation.suse.com/suma/3.2/ ,

72 Patch Cluster Prior to Upgrade SES 6

https://documentation.suse.com/sles/12-SP5/single-html/SLES-smt/#book-smt

https://documentation.suse.com/sles/15-SP1/single-html/SLES-rmt/#book-rmt

https://documentation.suse.com/suma/3.2/

6.2.10.3 Patch the Whole Cluster to the Latest Patches

1. Apply the latest patches of SUSE Enterprise Storage 5.5 and SUSE Linux Enterprise Server12 SP3 to each Ceph cluster node. Verify that correct software repositories are connectedto each cluster node (see Section 6.2.10.1, “Required Software Repositories”) and run DeepSeastage 0:


2. After stage 0 completes, verify that each cluster node's status includes 'HEALTH_OK'. Ifnot, resolve the problem before any possible reboots in the next steps.

3. Run zypper ps to check for processes that may still be running with outdated librariesor binaries, and reboot if there are any.

4. Verify that the running kernel is the latest available, and reboot if not. Check outputs ofthe following commands:

cephadm@adm > uname -acephadm@adm > rpm -qa kernel-default

5. Verify that the ceph package is version 12.2.12 or newer. Verify that the deepsea pack-age is version 0.8.9 or newer.

6. If you previously used any of the bluestore_cache settings, they are no longer effectivefrom ceph version 12.2.10. The new setting bluestore_cache_autotune which is setto 'true' by default disables manual cache sizing. To turn on the old behavior, you need toset bluestore_cache_autotune=false . Refer to Book “Administration Guide”, Chapter 25

“Ceph Cluster Configuration”, Section 25.2.1 “Automatic Cache Sizing” for details.

6.2.11 Verify the Current Environment

If the system has obvious problems, x them before starting the upgrade. Upgrading neverfixes existing system problems.

Check cluster performance. You can use commands such as rados bench , ceph tellosd.* bench , or iperf3 .

Verify access to gateways (such as iSCSI Gateway or Object Gateway) and RADOS BlockDevice.

73 Verify the Current Environment SES 6

Document specific parts of the system setup, such as network setup, partitioning, or in-stallation details.

Use supportconfig to collect important system information and save it outside clusternodes. Find more information in https://documentation.suse.com/sles/12-SP5/single-html/

SLES-admin/#sec-admsupport-supportconfig .

Ensure there is enough free disk space on each cluster node. Check free disk space with df-h . When needed, free up additional disk space by removing unneeded les/directoriesor removing obsolete OS snapshots. If there is not enough free disk space, do not continuewith the upgrade until you have freed enough disk space.

6.2.12 Check the Cluster's State

Check the cluster health command before starting the upgrade procedure. Do not startthe upgrade unless each cluster node reports 'HEALTH_OK'.

Verify that all services are running:

Salt master and Salt master daemons.

Ceph Monitor and Ceph Manager daemons.

Metadata Server daemons.

Ceph OSD daemons.

Object Gateway daemons.

iSCSI Gateway daemons.

The following commands provide details of the cluster state and specific configuration:

ceph -s

Prints a brief summary of Ceph cluster health, running services, data usage, and I/O sta-tistics. Verify that it reports 'HEALTH_OK' before starting the upgrade.

ceph health detail

Prints details if Ceph cluster health is not OK.

ceph versions

74 Check the Cluster's State SES 6

https://documentation.suse.com/sles/12-SP5/single-html/SLES-admin/#sec-admsupport-supportconfig

https://documentation.suse.com/sles/12-SP5/single-html/SLES-admin/#sec-admsupport-supportconfig

Prints versions of running Ceph daemons.

ceph df

Prints total and free disk space on the cluster. Do not start the upgrade if the cluster's freedisk space is less than 25% of the total disk space.

salt '*' cephprocesses.check results=true

Prints running Ceph processes and their PIDs sorted by Salt minions.

ceph osd dump | grep ^flags

Verify that 'recovery_deletes' and 'purged_snapdirs' ags are present. If not, you can forcea scrub on all placement groups by running the following command. Be aware that thisforced scrub may possibly have a negative impact on your Ceph clients’ performance.

cephadm@adm > ceph pg dump pgs_brief | cut -d " " -f 1 | xargs -n1 ceph pg scrub

6.2.13 Migrate OSDs to BlueStore

OSD BlueStore is a new back-end for the OSD daemons. It is the default option since SUSE En-terprise Storage 5. Compared to FileStore, which stores objects as les in an XFS le system,BlueStore can deliver increased performance because it stores objects directly on the underlyingblock device. BlueStore also enables other features, such as built-in compression and EC over-writes, that are unavailable with FileStore.

Specifically for BlueStore, an OSD has a 'wal' (Write Ahead Log) device and a 'db' (RocksDBdatabase) device. The RocksDB database holds the metadata for a BlueStore OSD. These twodevices will reside on the same device as an OSD by default, but either can be placed on different,for example faster, media.

In SUSE Enterprise Storage 5, both FileStore and BlueStore are supported and it is possible forFileStore and BlueStore OSDs to co-exist in a single cluster. During the SUSE Enterprise Storageupgrade procedure, FileStore OSDs are not automatically converted to BlueStore.

WarningMigration to BlueStore needs to be completed on all OSD nodes before the cluster upgradebecause FileStore OSDs are not supported in SES 6.

75 Migrate OSDs to BlueStore SES 6

Before converting to BlueStore, the OSDs need to be running SUSE Enterprise Storage 5. Theconversion is a slow process as all data gets re-written twice. Though the migration process cantake a long time to complete, there is no cluster outage and all clients can continue accessingthe cluster during this period. However, do expect lower performance for the duration of themigration. This is caused by rebalancing and backfilling of cluster data.

Use the following procedure to migrate FileStore OSDs to BlueStore:

Tip: Turn O Safety MeasuresSalt commands needed for running the migration are blocked by safety measures. In orderto turn these precautions o, run the following command:

root@master # salt-run disengage.safety

Rebuild the nodes before continuing:

root@master # salt-run rebuild.node TARGET

You can also choose to rebuild each node individually. For example:

root@master # salt-run rebuild.node data1.ceph

The rebuild.node always removes and recreates all OSDs on the node.

ImportantIf one OSD fails to convert, re-running the rebuild destroys the already-``` convertedBlueStore OSDs. Instead of re-running the rebuild, you can run:

root@master # salt-run disks.deploy TARGET

After the migration to BlueStore, the object count will remain the same and disk usage will benearly the same.

76 Migrate OSDs to BlueStore SES 6

6.3 Order in Which Nodes Must Be Upgraded

Certain types of daemons depend upon others. For example, Ceph Object Gateways dependupon Ceph MON and OSD daemons. We recommend upgrading in this order:

1. Admin Node

2. Ceph Monitors/Ceph Managers

3. Metadata Servers

4. Ceph OSDs

5. Object Gateways

6. iSCSI Gateways

7. NFS Ganesha

8. Samba Gateways

6.4 Oine Upgrade of CTDB ClustersCTDB provides a clustered database used by Samba Gateways. The CTDB protocol does not sup-port clusters of nodes communicating with different protocol versions. Therefore, CTDB nodesneed to be taken offline prior to performing a SUSE Enterprise Storage upgrade.

CTDB refuses to start if it is running alongside an incompatible version. For example, if you starta SUSE Enterprise Storage 6 CTDB version while SUSE Enterprise Storage 5.5 CTDB versionsare running, then it will fail.

To take the CTDB offline, stop the SLE-HA cloned CTDB resource. For example:

root@master # crm resource stop cl-ctdb

This will stop the resource across all gateway nodes (assigned to the cloned resource). Verify allthe services are stopped by running the following command:

root@master # crm status

77 Order in Which Nodes Must Be Upgraded SES 6

NoteEnsure CTDB is taken offline prior to the SUSE Enterprise Storage 5.5 to SUSE EnterpriseStorage 6 upgrade of the CTDB and Samba Gateway packages. SLE-HA may also spec-ify requirements for the upgrade of the underlying pacemaker/Linux-HA cluster; theseshould be tracked separately.

The SLE-HA cloned CTDB resource can be restarted once the new packages have been installedon all Samba Gateway nodes and the underlying pacemaker/Linux-HA cluster is up. To restartthe CTDB resource run the following command:

root@master # crm resource start cl-ctdb

6.5 Per-Node Upgrade InstructionsTo ensure the core cluster services are available during the upgrade, you need to upgrade thecluster nodes sequentially one by one. There are two ways you can perform the upgrade of anode: either using the installer DVD or using the distribution migration system.

After upgrading each node, we recommend running rpmconfigcheck to check for any updatedconfiguration les that have been edited locally. If the command returns a list of le names witha suffix .rpmnew , .rpmorig , or .rpmsave , compare these les against the current configura-tion les to ensure that no local changes have been lost. If necessary, update the affected les. Formore information on working with .rpmnew , .rpmorig , and .rpmsave les, refer to https://

documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#sec-rpm-packages-manage .

Tip: Orphaned PackagesAfter a node is upgraded, a number of packages will be in an 'orphaned' state without aparent repository. This happens because python3 related packages do not make python2packages obsolete.

Find more information about listing orphaned packages in https://documenta-

tion.suse.com/sles/12-SP5/single-html/SLES-admin/#sec-zypper-softup-orphaned .

78 Per-Node Upgrade Instructions SES 6

https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#sec-rpm-packages-manage

https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#sec-rpm-packages-manage

https://documentation.suse.com/sles/12-SP5/single-html/SLES-admin/#sec-zypper-softup-orphaned

https://documentation.suse.com/sles/12-SP5/single-html/SLES-admin/#sec-zypper-softup-orphaned

6.5.1 Manual Node Upgrade Using the Installer DVD

1. Reboot the node from the SUSE Linux Enterprise Server 15 SP1 installer DVD/image.

2. On the YaST command line, add the option YAST_ACTIVATE_LUKS=0 . This option ensuresthat the system does not ask for a password for encrypted disks.

WarningThis must not be enabled by default as it would break full-disk encryption on thesystem disk or part of the system disk. This parameter only works if it is providedby the installer. If not provided, you will be prompted for an encryption passwordfor each individual disk partition.

This is only supported since the 3rd Quarterly Update of SLES 15 SP1. You needSLE-15-SP1-Installer-DVD-*-QU3-DVD1.iso media or newer.

3. Select Upgrade from the boot menu.

4. On the Select the Migration Target screen, verify that 'SUSE Linux Enterprise Server 15 SP1'is selected and activate the Manually Adjust the Repositories for Migration check box.

FIGURE 6.1: SELECT THE MIGRATION TARGET

79 Manual Node Upgrade Using the Installer DVD SES 6

5. Select the following modules to install:

SUSE Enterprise Storage 6 x86_64

Basesystem Module 15 SP1 x86_64

Desktop Applications Module 15 SP1 x86_64

Legacy Module 15 SP1 x86_64

Server Applications Module 15 SP1 x86_64

6. On the Previously Used Repositories screen, verify that the correct repositories are selected.If the system is not registered with SCC/SMT, you need to add the repositories manually.SUSE Enterprise Storage 6 requires:

SLE-Module-Basesystem15-SP1-Pool

SLE-Module-Basesystem15-SP1-Updates

SLE-Module-Server-Applications15-SP1-Pool

SLE-Module-Server-Applications15-SP1-Updates

SLE-Module-Desktop-Applications15-SP1-Pool

SLE-Module-Desktop-Applications15-SP1-Updates

SLE-Product-SLES15-SP1-Pool

SLE-Product-SLES15-SP1-Updates

SLE15-SP1-Installer-Updates

SUSE-Enterprise-Storage-6-Pool

SUSE-Enterprise-Storage-6-Updates

If you intend to migrate ntpd to chronyd after SES migration (refer to Section 6.2.9,

“Migrate from ntpd to chronyd”), include the following repositories:

SLE-Module-Legacy15-SP1-Pool

SLE-Module-Legacy15-SP1-Updates

80 Manual Node Upgrade Using the Installer DVD SES 6

NFS/SMB Gateway on SLE-HA on SUSE Linux Enterprise Server 15 SP1 requires:

SLE-Product-HA15-SP1-Pool

SLE-Product-HA15-SP1-Updates

7. Review the Installation Settings and start the installation procedure by clicking Update.

6.5.2 Node Upgrade Using the SUSE Distribution Migration System

The Distribution Migration System (DMS) provides an upgrade path for an installed SUSE LinuxEnterprise system from one major version to another. The following procedure utilizes DMS toupgrade SUSE Enterprise Storage 5.5 to version 6, including the underlying SUSE Linux Enter-prise Server 12 SP3 to SUSE Linux Enterprise Server 15 SP1 migration.

Refer to https://documentation.suse.com/suse-distribution-migration-system/1.0/single-html/dis-

tribution-migration-system/ to nd both general and detailed information about DMS.

6.5.2.1 Before You Begin

Before the starting the upgrade process, check whether the sles-ltss-release or sles-ltss-release-POOL packages are installed on any node of the cluster:

root@minion > rpm -q sles-ltss-release root@minion > rpm -q sles-ltss-release-POOL

If either or both are installed, remove them:

root@minion > zypper rm -y sles-ltss-release sles-ltss-release-POOL

ImportantThis must be done on all nodes of the cluster before proceeding.

NoteEnsure you also follow the Section 6.2.12, “Check the Cluster's State” guidelines. The upgrademust not be started until all nodes are fully patched. See Section 6.2.10.3, “Patch the Whole

Cluster to the Latest Patches” for more information.

81 Node Upgrade Using the SUSE Distribution Migration System SES 6

https://documentation.suse.com/suse-distribution-migration-system/1.0/single-html/distribution-migration-system/


6.5.2.2 Upgrading Nodes

1. Install the migration RPM packages. They adjust the GRUB boot loader to automaticallytrigger the upgrade on next reboot. Install the SLES15-SES-Migration and suse-mi-gration-sle15-activation packages:

root@minion > zypper install SLES15-SES-Migration suse-migration-sle15-activation

2. a. If the node being upgraded is registered with a repository staging system such asSCC, SMT, RMT, or SUSE Manager, create the /etc/sle-migration-service.ymlwith the following content:

use_zypper_migration: true preserve: rules: - /etc/udev/rules.d/70-persistent-net.rules

b. If the node being upgraded is not registered with a repository staging system suchas SCC, SMT, RMT, or SUSE Manager, perform the following changes:

i. Create the /etc/sle-migration-service.yml with the following content:

use_zypper_migration: false preserve: rules: - /etc/udev/rules.d/70-persistent-net.rules

ii. Disable or remove the SLE 12 SP3 and SES 5 repos, and add the SLE 15 SP1and SES6 repos. Find the list of related repositories in Section 6.2.10.1, “Required

Software Repositories”.

3. Reboot to start the upgrade. While the upgrade is running, you can log in to the upgrad-ed node via ssh as the migration user using the existing SSH key from the host systemas described in https://documentation.suse.com/suse-distribution-migration-system/1.0/sin-

gle-html/distribution-migration-system/ . For SUSE Enterprise Storage, if you have phys-ical access or direct console access to the machine, you can also log in as root on thesystem console using the password sesupgrade . The node will reboot automatically afterthe upgrade.

82 Node Upgrade Using the SUSE Distribution Migration System SES 6



Tip: Upgrade FailureIf the upgrade fails, inspect /var/log/distro_migration.log . Fix the problem,re-install the migration RPM packages, and reboot the node.

6.6 Upgrade the Admin Node

The following commands will still work, although Salt minions are running old versionsof Ceph and Salt: salt '*' test.ping and ceph status

After the upgrade of the Admin Node, openATTIC will no longer be installed.

If the Admin Node hosted SMT, complete its migration to RMT (refer to https://documen-

tation.suse.com/sles/15-SP1/single-html/SLES-rmt/#cha-rmt-migrate ).

Use the procedure described in Section 6.5, “Per-Node Upgrade Instructions”.

Tip: Status of Cluster NodesAfter the Admin Node is upgraded, you can run the salt-run upgrade.status com-mand to view useful information about cluster nodes. The command lists the Ceph andOS versions of all nodes, and recommends the order in which to upgrade any nodes thatare still running old versions.

root@master # salt-run upgrade.statusThe newest installed software versions are: ceph: ceph version 14.2.1-468-g994fd9e0cc (994fd9e0ccc50c2f3a55a3b7a3d4e0ba74786d50) nautilus (stable) os: SUSE Linux Enterprise Server 15 SP1

Nodes running these software versions: admin.ceph (assigned roles: master) mon2.ceph (assigned roles: admin, mon, mgr)

Nodes running older software versions must be upgraded in the following order: 1: mon1.ceph (assigned roles: admin, mon, mgr) 2: mon3.ceph (assigned roles: admin, mon, mgr) 3: data1.ceph (assigned roles: storage)[...]

83 Upgrade the Admin Node SES 6

https://documentation.suse.com/sles/15-SP1/single-html/SLES-rmt/#cha-rmt-migrate

https://documentation.suse.com/sles/15-SP1/single-html/SLES-rmt/#cha-rmt-migrate

6.7 Upgrade Ceph Monitor/Ceph Manager Nodes

If your cluster does not use MDS roles, upgrade MON/MGR nodes one by one.

If your cluster uses MDS roles, and MON/MGR and MDS roles are co-located, you need toshrink the MDS cluster and then upgrade the co-located nodes. Refer to Section 6.8, “Upgrade

Metadata Servers” for more details.

If your cluster uses MDS roles and they run on dedicated servers, upgrade all MON/MGRnodes one by one, then shrink the MDS cluster and upgrade it. Refer to Section 6.8, “Upgrade

Metadata Servers” for more details.

Note: Ceph Monitor UpgradeDue to a limitation in the Ceph Monitor design, once two MONs have been upgradedto SUSE Enterprise Storage 6 and have formed a quorum, the third MON (while still onSUSE Enterprise Storage 5.5) will not rejoin the MON cluster if it restarted for any reason,including a node reboot. Therefore, when two MONs have been upgraded it is best toupgrade the rest as soon as possible.


6.8 Upgrade Metadata ServersYou need to shrink the Metadata Server (MDS) cluster. Because of incompatible features betweenthe SUSE Enterprise Storage 5.5 and 6 versions, the older MDS daemons will shut down as soonas they see a single SES 6 level MDS join the cluster. Therefore it is necessary to shrink the MDScluster to a single active MDS (and no standbys) for the duration of the MDS node upgrades. Assoon as the second node is upgraded, you can extend the MDS cluster again.

TipOn a heavily loaded MDS cluster, you may need to reduce the load (for example bystopping clients) so that a single active MDS is able to handle the workload.

84 Upgrade Ceph Monitor/Ceph Manager Nodes SES 6

1. Note the current value of the max_mds option:

cephadm@adm > ceph fs get cephfs | grep max_mds

2. Shrink the MDS cluster if you have more then 1 active MDS daemon, i.e. max_mds is >1. To shrink the MDS cluster, run

cephadm@adm > ceph fs set FS_NAME max_mds 1

where FS_NAME is the name of your CephFS instance ('cephfs' by default).

3. Find the node hosting one of the standby MDS daemons. Consult the output of the cephfs status command and start the upgrade of the MDS cluster on this node.

cephadm@adm > ceph fs statuscephfs - 2 clients======+------+--------+--------+---------------+-------+-------+| Rank | State | MDS | Activity | dns | inos |+------+--------+--------+---------------+-------+-------+| 0 | active | mon1-6 | Reqs: 0 /s | 13 | 16 |+------+--------+--------+---------------+-------+-------++-----------------+----------+-------+-------+| Pool | type | used | avail |+-----------------+----------+-------+-------+| cephfs_metadata | metadata | 2688k | 96.8G || cephfs_data | data | 0 | 96.8G |+-----------------+----------+-------+-------++-------------+| Standby MDS |+-------------+| mon3-6 || mon2-6 |+-------------+

In this example, you need to start the upgrade procedure either on node 'mon3-6' or'mon2-6'.

4. Upgrade the node with the standby MDS daemon. After the upgraded MDS node starts, theoutdated MDS daemons will shut down automatically. At this point, clients may experiencea short downtime of the CephFS service.Use the procedure described in Section 6.5, “Per-Node Upgrade Instructions”.

5. Upgrade the remaining MDS nodes.

85 Upgrade Metadata Servers SES 6

6. Reset max_mds to the desired configuration:

cephadm@adm > ceph fs set FS_NAME max_mds ACTIVE_MDS_COUNT

6.9 Upgrade Ceph OSDsFor each storage node, follow these steps:

1. Identify which OSD daemons are running on a particular node:

cephadm@adm > ceph osd tree

2. Set the noout ag for each OSD daemon on the node that is being upgraded:

cephadm@adm > ceph osd add-noout osd.OSD_ID

For example:

cephadm@adm > for i in $(ceph osd ls-tree OSD_NODE_NAME);do echo "osd: $i"; ceph osd add-noout osd.$i; done

Verify with:

cephadm@adm > ceph health detail | grep noout

or

cephadm@adm > ceph –scluster: id: 44442296-033b-3275-a803-345337dc53da health: HEALTH_WARN 6 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set

3. Create /etc/ceph/osd/*.json les for all existing OSDs by running the following com-mand on the node that is going to be upgraded:

cephadm@osd > ceph-volume simple scan --force

4. Upgrade the OSD node. Use the procedure described in Section 6.5, “Per-Node Upgrade

Instructions”.

86 Upgrade Ceph OSDs SES 6

5. Activate all OSDs found in the system:

cephadm@osd > ceph-volume simple activate --all

Tip: Activating Data Partitions IndividuallyIf you want to activate data partitions individually, you need to nd the correctceph-volume command for each partition to activate it. Replace X1 with the par-tition's correct letter/number:

cephadm@osd > ceph-volume simple scan /dev/sdX1

For example:

cephadm@osd > ceph-volume simple scan /dev/vdb1[...]--> OSD 8 got scanned and metadata persisted to file:/etc/ceph/osd/8-d7bd2685-5b92-4074-8161-30d146cd0290.json--> To take over management of this scanned OSD, and disable ceph-diskand udev, run:--> ceph-volume simple activate 8 d7bd2685-5b92-4074-8161-30d146cd0290

The last line of the output contains the command to activate the partition:

cephadm@osd > ceph-volume simple activate 8 d7bd2685-5b92-4074-8161-30d146cd0290[...]--> All ceph-disk systemd units have been disabled to prevent OSDsgetting triggered by UDEV events[...]Running command: /bin/systemctl start ceph-osd@8--> Successfully activated OSD 8 with FSIDd7bd2685-5b92-4074-8161-30d146cd0290

6. Verify that the OSD node will start properly after the reboot.

7. Address the 'Legacy BlueStore stats reporting detected on XX OSD(s)' message:

cephadm@adm > ceph –scluster: id: 44442296-033b-3275-a803-345337dc53da health: HEALTH_WARN Legacy BlueStore stats reporting detected on 6 OSD(s)


The warning is normal when upgrading Ceph to 14.2.2. You can disable it by setting:

bluestore_warn_on_legacy_statfs = false

The proper x is to run the following command on all OSDs while they are stopped:

cephadm@osd > ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-XXX

Following is a helper script that runs the ceph-bluestore-tool repair for all OSDs onthe NODE_NAME node:

cephadm@adm > OSDNODE=OSD_NODE_NAME;\ for OSD in $(ceph osd ls-tree $OSDNODE);\ do echo "osd=" $OSD;\ salt $OSDNODE* cmd.run "systemctl stop ceph-osd@$OSD";\ salt $OSDNODE* cmd.run "ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-$OSD";\ salt $OSDNODE* cmd.run "systemctl start ceph-osd@$OSD";\ done

8. Unset the 'noout' ag for each OSD daemon on the node that is upgraded:

cephadm@adm > ceph osd rm-noout osd.OSD_ID

For example:

cephadm@adm > for i in $(ceph osd ls-tree OSD_NODE_NAME);do echo "osd: $i"; ceph osd rm-noout osd.$i; done

Verify with:

cephadm@adm > ceph health detail | grep noout

Note:

cephadm@adm > ceph –scluster: id: 44442296-033b-3275-a803-345337dc53da health: HEALTH_WARN Legacy BlueStore stats reporting detected on 6 OSD(s)

9. Verify the cluster status. It will be similar to the following output:

cephadm@adm > ceph status


cluster: id: e0d53d64-6812-3dfe-8b72-fd454a6dcf12 health: HEALTH_WARN 3 monitors have not enabled msgr2

services: mon: 3 daemons, quorum mon1,mon2,mon3 (age 2h) mgr: mon2(active, since 22m), standbys: mon1, mon3 osd: 30 osds: 30 up, 30 in

data: pools: 1 pools, 1024 pgs objects: 0 objects, 0 B usage: 31 GiB used, 566 GiB / 597 GiB avail pgs: 1024 active+clean

10. Once the last OSD node has been upgraded, issue the following command:

cephadm@adm > ceph osd require-osd-release nautilus

This disallows pre-SUSE Enterprise Storage 6 and Nautilus OSDs and enables all new SUSEEnterprise Storage 6 and Nautilus-only OSD functionality.

11. Enable the new v2 network protocol by issuing the following command:

cephadm@adm > ceph mon enable-msgr2

This instructs all monitors that bind to the old default port for the legacy v1 Messengerprotocol (6789) to also bind to the new v2 protocol port (3300). To see if all monitorshave been updated, run:

cephadm@adm > ceph mon dump

Verify that each monitor has both a v2: and v1: address listed.

12. Verify that all OSD nodes were rebooted and that OSDs started automatically after thereboot.

6.10 Upgrade Gateway NodesUpgrade gateway nodes in the following order:

1. Object Gateways

89 Upgrade Gateway Nodes SES 6

If the Object Gateways are fronted by a load balancer, then a rolling upgrade of theObject Gateways should be possible without an outage.

Validate that the Object Gateway daemons are running after each upgrade, and testwith S3/Swift client.


2. iSCSI Gateways

Important: Package Dependency ConflictAfter a package dependency is calculated, you need to resolve a package dependencyconflict. It applies to the patterns-ses-ceph_iscsi version mismatch.

FIGURE 6.2: DEPENDENCY CONFLICT RESOLUTION

From the four presented solutions, choose deinstalling the patterns-ses-

ceph_iscsi pattern. This way you will keep the required lrbd package installed.

90 Upgrade Gateway Nodes SES 6

If iSCSI initiators are configured with multipath, then a rolling upgrade of the iSCSIGateways should be possible without an outage.

Validate that the lrbd daemon is running after each upgrade, and test with initiator.


3. NFS Ganesha. Use the procedure described in Section 6.5, “Per-Node Upgrade Instruc-

tions”.

4. Samba Gateways. Use the procedure described in Section 6.5, “Per-Node Upgrade Instruc-

tions”.

6.11 Steps to Take after the Last Node Has BeenUpgraded

6.11.1 Update Ceph Monitor Setting

For each host that has been upgraded — OSD, MON, MGR, MDS, and Gateway nodes, as wellas client hosts — update your ceph.conf le so that it either specifies no monitor port (if youare running the monitors on the default ports) or references both the v2 and v1 addresses andports explicitly.

NoteThings will still work if only the v1 IP and port are listed, but each CLI instantiationor daemon will need to reconnect after learning that the monitors also speak the v2protocol. This slows things down and prevents a full transition to the v2 protocol.

6.11.2 Enable the Telemetry Module

Finally, consider enabling the Telemetry module to send anonymized usage statistics and crashinformation to the upstream Ceph developers. To see what would be reported (without actuallysending any information to anyone):

cephadm@adm > ceph mgr module enable telemetry

91 Steps to Take after the Last Node Has Been Upgraded SES 6

cephadm@adm > ceph telemetry show

If you are comfortable with the high-level cluster metadata that will be reported, you can opt-in to automatically report it:

cephadm@adm > ceph telemetry on

6.12 Update policy.cfg and Deploy CephDashboard Using DeepSeaOn the Admin Node, edit /srv/pillar/ceph/proposals/policy.cfg and apply the followingchanges:

Important: No New ServicesDuring cluster upgrade, do not add new services to the policy.cfg le. Change thecluster architecture only after the upgrade is completed.

1. Remove role-openattic .

2. Add role-prometheus and role-grafana to the node that had Prometheus and Grafanainstalled, usually the Admin Node.

3. Role profile-PROFILE_NAME is now ignored. Add new corresponding role, role-stor-age line. For example, for existing

profile-default/cluster/*.sls

add

role-storage/cluster/*.sls

4. Synchronize all Salt modules:

root@master # salt '*' saltutil.sync_all

5. Update the Salt pillar by running DeepSea stage 1 and stage 2:


92 Update policy.cfg and Deploy Ceph Dashboard Using DeepSea SES 6


6. Clean up openATTIC:

root@master # salt OA_MINION state.apply ceph.rescind.openatticroot@master # salt OA_MINION state.apply ceph.remove.openattic

7. Unset the restart_igw grain to prevent stage 0 from restarting iSCSI Gateway, whichis not installed yet:

root@master # salt '*' grains.delkey restart_igw

8. Finally, run through DeepSea stages 0-4:

root@master # salt-run state.orch ceph.stage.0root@master # salt-run state.orch ceph.stage.1root@master # salt-run state.orch ceph.stage.2root@master # salt-run state.orch ceph.stage.3root@master # salt-run state.orch ceph.stage.4

Tip: 'subvolume missing' Errors during Stage 3DeepSea stage 3 may fail with an error similar to the following:

subvolume : ['/var/lib/ceph subvolume missing on 4510-2', \'/var/lib/ceph subvolume missing on 4510-1', \[...]'See /srv/salt/ceph/subvolume/README.md']

In this case, you need to edit /srv/pillar/ceph/stack/global.yml and add thefollowing line:

subvolume_init: disabled

Then refresh the Salt pillar and re-run DeepSea stage.3:

root@master # salt '*' saltutil.refresh_pillar root@master # salt-run state.orch ceph.stage.3

After DeepSea successfully finished stage.3, the Ceph Dashboard will be running.Refer to Book “Administration Guide” for a detailed overview of Ceph Dashboard fea-tures.

93 Update policy.cfg and Deploy Ceph Dashboard Using DeepSea SES 6

To list nodes running dashboard, run:

cephadm@adm > ceph mgr services | grep dashboard

To list admin credentials, run:

root@master # salt-call grains.get dashboard_creds

9. Sequentially restart the Object Gateway services to use 'beast' Web server instead of theoutdated 'civetweb':

root@master # salt-run state.orch ceph.restart.rgw.force

10. Before you continue, we strongly recommend enabling the Ceph telemetry module. Formore information, see Book “Administration Guide”, Chapter 21 “Ceph Manager Modules”, Sec-

tion 21.2 “Telemetry Module” for information and instructions.

6.13 Migration from Profile-based Deployments toDriveGroupsIn SUSE Enterprise Storage 5.5, DeepSea offered so called 'profiles' to describe the layout ofyour OSDs. Starting with SUSE Enterprise Storage 6, we moved to a different approach calledDriveGroups (nd more details in Section 5.5.2, “DriveGroups”).

NoteMigrating to the new approach is not immediately mandatory. Destructive operations,such as salt-run osd.remove , salt-run osd.replace , or salt-run osd.purge arestill available. However, adding new OSDs will require your action.

Because of the different approach of these implementations, we do not offer an automated mi-gration path. However, we offer a variety of tools—Salt runners—to make the migration assimple as possible.

94 Migration from Profile-based Deployments to DriveGroups SES 6

6.13.1 Analyze the Current Layout

To view information about the currently deployed OSDs, use the following command:

root@master # salt-run disks.discover

Alternatively, you can inspect the content of the les in the /srv/pillar/ceph/propos-als/profile-*/ directories. They have a similar structure to the following:

ceph: storage: osds: /dev/disk/by-id/scsi-drive_name: format: bluestore /dev/disk/by-id/scsi-drive_name2: format: bluestore

6.13.2 Create DriveGroups Matching the Current Layout

Refer to Section 5.5.2.1, “Specification” for more details on DriveGroups specification.

The difference between a fresh deployment and upgrade scenario is that the drives to be migratedare already 'used'. Because

root@master # salt-run disks.list

looks for unused disks only, use

root@master # salt-run disks.list include_unavailable=True

Adjust DriveGroups until you match your current setup. For a more visual representation ofwhat will be happening, use the following command. Note that it has no output if there areno free disks:

root@master # salt-run disks.report bypass_pillar=True

If you verified that your DriveGroups are properly configured and want to apply the new ap-proach, remove the les from the /srv/pillar/ceph/proposals/profile-PROFILE_NAME/directory, remove the corresponding profile-PROFILE_NAME/cluster/*.sls lines from the/srv/pillar/ceph/proposals/policy.cfg le, and run DeepSea stage 2 to refresh the Saltpillar.


Verify the result by running the following commands:

root@master # salt target_node pillar.get ceph:storage

95 Analyze the Current Layout SES 6


Warning: Incorrect DriveGroups ConfigurationIf your DriveGroups are not properly configured and there are spare disks in your setup,they will be deployed in the way you specified them. We recommend running:


6.13.3 OSD Deployment

As of the Ceph Mimic release (SES 5), the ceph-disk tool is deprecated, and as of the CephNautilus release (SES 6) it is no longer shipped upstream.

ceph-disk is still supported in SUSE Enterprise Storage 6. Any pre-deployed ceph-disk OSDswill continue to function normally. However, when a disk breaks there is no migration path: adisk will need to be re-deployed.

For completeness, consider migrating OSDs on the whole node. There are two paths for SUSEEnterprise Storage 6 users:

Keep OSDs deployed with ceph-disk : The simple command provides a way to take overthe management while disabling ceph-disk triggers.

Re-deploy existings OSDs with ceph-volume . For more information on replacing yourOSDs, see Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.8 “Re-

placing an OSD Disk”.

Tip: Migrate to LVM FormatWhenever a single legacy OSD needs to be replaced on a node, all OSDs that share deviceswith it need to be migrated to the LVM-based format.

6.13.4 More Complex Setups

If you have a more sophisticated setup than just stand-alone OSDs, for example dedicated WAL/DBs or encrypted OSDs, the migration can only happen when all OSDs assigned to that WAL/DBdevice are removed. This is due to the ceph-volume command that creates Logical Volumes on

96 OSD Deployment SES 6

disks before deployment. This prevents the user from mixing partition based deployments withLV based deployments. In such cases it is best to manually remove all OSDs that are assigned toa WAL/DB device and re-deploy them using the DriveGroups approach.

97 More Complex Setups SES 6

7 Customizing the Default Configuration

You can change the default cluster configuration generated in Stage 2 (refer to DeepSea Stages

Description). For example, you may need to change network settings, or software that is installedon the Admin Node by default. You can perform the former by modifying the pillar updatedafter Stage 2, while the latter is usually done by creating a custom sls le and adding it to thepillar. Details are described in following sections.

7.1 Using Customized Configuration FilesThis section lists several tasks that require adding/changing your own sls les. Such a proce-dure is typically used when you need to change the default deployment process.

Tip: Prefix Custom .sls FilesYour custom .sls les belong to the same subdirectory as DeepSea's .sls les. To preventoverwriting your .sls les with the possibly newly added ones from the DeepSea package,prefix their name with the custom- string.

7.1.1 Disabling a Deployment Step

If you address a specific task outside of the DeepSea deployment process and therefore need toskip it, create a 'no-operation' le following this example:

PROCEDURE 7.1: DISABLING TIME SYNCHRONIZATION

1. Create /srv/salt/ceph/time/disabled.sls with the following content and save it:

disable time setting:test.nop

2. Edit /srv/pillar/ceph/stack/global.yml , add the following line, and save it:

time_init: disabled

3. Verify by refreshing the pillar and running the step:

root@master # salt target saltutil.pillar_refreshroot@master # salt 'admin.ceph' state.apply ceph.time

98 Using Customized Configuration Files SES 6

admin.ceph: Name: disable time setting - Function: test.nop - Result: Clean

Summary for admin.ceph------------Succeeded: 1Failed: 0------------Total states run: 1

Note: Unique IDThe task ID 'disable time setting' may be any message unique within an sls le.Prevent ID collisions by specifying unique descriptions.

7.1.2 Replacing a Deployment Step

If you need to replace the default behavior of a specific step with a custom one, create a customsls le with replacement content.

By default /srv/salt/ceph/pool/default.sls creates an rbd image called 'demo'. In ourexample, we do not want this image to be created, but we need two images: 'archive1' and'archive2'.

PROCEDURE 7.2: REPLACING THE DEMO RBD IMAGE WITH TWO CUSTOM RBD IMAGES

1. Create /srv/salt/ceph/pool/custom.sls with the following content and save it:

wait: module.run: - name: wait.out - kwargs: 'status': "HEALTH_ERR" 1

- fire_event: True

archive1: cmd.run: - name: "rbd -p rbd create archive1 --size=1024" 2

- unless: "rbd -p rbd ls | grep -q archive1$" - fire_event: True

archive2: cmd.run:

99 Replacing a Deployment Step SES 6

- name: "rbd -p rbd create archive2 --size=768" - unless: "rbd -p rbd ls | grep -q archive2$" - fire_event: True

1 The wait module will pause until the Ceph cluster does not have a status ofHEALTH_ERR . In fresh installations, a Ceph cluster may have this status until a suffi-cient number of OSDs become available and the creation of pools has completed.

2 The rbd command is not idempotent. If the same creation command is re-run afterthe image exists, the Salt state will fail. The unless statement prevents this.

2. To call the newly created custom le instead of the default, you need to edit /srv/pil-lar/ceph/stack/ceph/cluster.yml , add the following line, and save it:

pool_init: custom


root@master # salt target saltutil.pillar_refreshroot@master # salt 'admin.ceph' state.apply ceph.pool

Note: AuthorizationThe creation of pools or images requires sufficient authorization. The admin.ceph min-ion has an admin keyring.

Tip: Alternative WayAnother option is to change the variable in /srv/pillar/ceph/stack/ceph/roles/master.yml instead. Using this le will reduce the clutter of pillar data for other minions.

7.1.3 Modifying a Deployment Step

Sometimes you may need a specific step to do some additional tasks. We do not recommendmodifying the related state le as it may complicate a future upgrade. Instead, create a separatele to carry out the additional tasks identical to what was described in Section 7.1.2, “Replacing

a Deployment Step”.

Name the new sls le descriptively. For example, if you need to create two rbd images inaddition to the demo image, name the le archive.sls .

100 Modifying a Deployment Step SES 6

PROCEDURE 7.3: CREATING TWO ADDITIONAL RBD IMAGES

1. Create /srv/salt/ceph/pool/custom.sls with the following content and save it:

include: - .archive - .default

Tip: Include PrecedenceIn this example, Salt will create the archive images and then create the demo image.The order does not matter in this example. To change the order, reverse the linesafter the include: directive.

You can add the include line directly to archive.sls and all the images willget created as well. However, regardless of where the include line is placed, Saltprocesses the steps in the included le rst. Although this behavior can be over-ridden with requires and order statements, a separate le that includes the othersguarantees the order and reduces the chances of confusion.

2. Edit /srv/pillar/ceph/stack/ceph/cluster.yml , add the following line, and save it:

pool_init: custom


root@master # salt target saltutil.pillar_refreshroot@master # salt 'admin.ceph' state.apply ceph.pool

7.1.4 Modifying a Deployment Stage

If you need to add a completely separate deployment step, create three new les—an sls lethat performs the command, an orchestration le, and a custom le which aligns the new stepwith the original deployment steps.

For example, if you need to run logrotate on all minions as part of the preparation stage:

First create an sls le and include the logrotate command.

PROCEDURE 7.4: RUNNING logrotate ON ALL SALT MINIONS

1. Create a directory such as /srv/salt/ceph/logrotate .

101 Modifying a Deployment Stage SES 6

2. Create /srv/salt/ceph/logrotate/init.sls with the following content and save it:

rotate logs: cmd.run: - name: "/usr/sbin/logrotate /etc/logrotate.conf"

3. Verify that the command works on a minion:

root@master # salt 'admin.ceph' state.apply ceph.logrotate

Because the orchestration le needs to run before all other preparation steps, add it to the Prepstage 0:

1. Create /srv/salt/ceph/stage/prep/logrotate.sls with the following content andsave it:

logrotate: salt.state: - tgt: '*' - sls: ceph.logrotate

2. Verify that the orchestration le works:

root@master # salt-run state.orch ceph.stage.prep.logrotate

The last le is the custom one which includes the additional step with the original steps:

1. Create /srv/salt/ceph/stage/prep/custom.sls with the following content and saveit:

include: - .logrotate - .master - .minion

2. Override the default behavior. Edit /srv/pillar/ceph/stack/global.yml , add the fol-lowing line, and save the le:

stage_prep: custom

3. Verify that Stage 0 works:


102 Modifying a Deployment Stage SES 6

Note: Why global.yml?The global.yml le is chosen over the cluster.yml because during the prep stage, nominion belongs to the Ceph cluster and has no access to any settings in cluster.yml .

7.1.5 Updates and Reboots during Stage 0

During stage 0 (refer to DeepSea Stages Description for more information on DeepSea stages),the Salt master and Salt minions may optionally reboot because newly updated packages, forexample kernel , require rebooting the system.

The default behavior is to install available new updates and not reboot the nodes even in caseof kernel updates.

You can change the default update/reboot behavior of DeepSea stage 0 by adding/chang-ing the stage_prep_master and stage_prep_minion options in the /srv/pillar/ceph/stack/global.yml le. stage_prep_master sets the behavior of the Salt master, andstage_prep_minion sets the behavior of all minions. All available parameters are:

default

Install updates without rebooting.

default-update-reboot

Install updates and reboot after updating.

default-no-update-reboot

Reboot without installing updates.

default-no-update-no-reboot

Do not install updates or reboot.

For example, to prevent the cluster nodes from installing updates and rebooting, edit /srv/pillar/ceph/stack/global.yml and add the following lines:

stage_prep_master: default-no-update-no-rebootstage_prep_minion: default-no-update-no-reboot

103 Updates and Reboots during Stage 0 SES 6

Tip: Values and Corresponding FilesThe values of stage_prep_master correspond to le names located in /srv/salt/ceph/stage/0/master , while values of stage_prep_minion correspond to les in /srv/salt/ceph/stage/0/minion :

root@master # ls -l /srv/salt/ceph/stage/0/masterdefault-no-update-no-reboot.slsdefault-no-update-reboot.slsdefault-update-reboot.sls[...]

root@master # ls -l /srv/salt/ceph/stage/0/miniondefault-no-update-no-reboot.slsdefault-no-update-reboot.slsdefault-update-reboot.sls[...]

7.2 Modifying Discovered ConfigurationAfter you completed Stage 2, you may want to change the discovered configuration. To viewthe current settings, run:

root@master # salt target pillar.items

The output of the default configuration for a single minion is usually similar to the following:

---------- available_roles: - admin - mon - storage - mds - igw - rgw - client-cephfs - client-radosgw - client-iscsi - mds-nfs - rgw-nfs - master

104 Modifying Discovered Configuration SES 6

cluster: ceph cluster_network: 172.16.22.0/24 fsid: e08ec63c-8268-3f04-bcdb-614921e94342 master_minion: admin.ceph mon_host: - 172.16.21.13 - 172.16.21.11 - 172.16.21.12 mon_initial_members: - mon3 - mon1 - mon2 public_address: 172.16.21.11 public_network: 172.16.21.0/24 roles: - admin - mon - mds time_server: admin.ceph time_service: ntp

The above mentioned settings are distributed across several configuration les. The directorystructure with these les is defined in the /srv/pillar/ceph/stack/stack.cfg directory.The following les usually describe your cluster:

/srv/pillar/ceph/stack/global.yml - the le affects all minions in the Salt cluster.

/srv/pillar/ceph/stack/ceph/cluster.yml - the le affects all minions in the Cephcluster called ceph .

/srv/pillar/ceph/stack/ceph/roles/role.yml - affects all minions that are assignedthe specific role in the ceph cluster.

/srv/pillar/ceph/stack/ceph/minions/MINION_ID/yml - affects the individual min-ion.

105 Modifying Discovered Configuration SES 6

Note: Overwriting Directories with Default ValuesThere is a parallel directory tree that stores the default configuration setup in /srv/pillar/ceph/stack/default . Do not change values here, as they are overwritten.

The typical procedure for changing the collected configuration is the following:

1. Find the location of the configuration item you need to change. For example, if you needto change cluster related setting such as cluster network, edit the le /srv/pillar/ceph/stack/ceph/cluster.yml .

2. Save the le.

3. Verify the changes by running:

root@master # salt target saltutil.pillar_refresh

and then

root@master # salt target pillar.items

7.2.1 Enabling IPv6 for Ceph Cluster Deployment

Since IPv4 network addressing is prevalent, you need to enable IPv6 as a customization. DeepSeahas no auto-discovery of IPv6 addressing.

To configure IPv6, set the public_network and cluster_network variables in the /srv/pillar/ceph/stack/global.yml le to valid IPv6 subnets. For example:

public_network: fd00:10::/64cluster_network: fd00:11::/64

Then run DeepSea stage 2 and verify that the network information matches the setting. Stage3 will generate the ceph.conf with the necessary ags.

Important: No Support for Dual StackCeph does not support dual stack—running Ceph simultaneously on IPv4 and IPv6 isnot possible. DeepSea validation will reject a mismatch between public_network andcluster_network or within either variable. The following example will fail the valida-tion.

106 Enabling IPv6 for Ceph Cluster Deployment SES 6

public_network: "192.168.10.0/24 fd00:10::/64"

Tip: Avoid Using fe80::/10 link-local AddressesAvoid using fe80::/10 link-local addresses. All network interfaces have an assignedfe80 address and require an interface qualifier for proper routing. Either assign IPv6addresses allocated to your site or consider using fd00::/8 . These are part of ULA andnot globally routable.

107 Enabling IPv6 for Ceph Cluster Deployment SES 6

III Installation of Additional Services

8 Installation of Services to Access your Data 109

9 Ceph Object Gateway 110

10 Installation of iSCSI Gateway 117

11 Installation of CephFS 130

12 Installation of NFS Ganesha 145

8 Installation of Services to Access your Data

After you deploy your SUSE Enterprise Storage 6 cluster you may need to install additionalsoftware for accessing your data, such as the Object Gateway or the iSCSI Gateway, or youcan deploy a clustered le system on top of the Ceph cluster. This chapter mainly focuses onmanual installation. If you have a cluster deployed using Salt, refer to Chapter 5, Deploying with

DeepSea/Salt for a procedure on installing particular gateways or the CephFS.

109 SES 6

9 Ceph Object Gateway

Ceph Object Gateway is an object storage interface built on top of librgw to provide applica-tions with a RESTful gateway to Ceph clusters. It supports two interfaces:

S3-compatible: Provides object storage functionality with an interface that is compatiblewith a large subset of the Amazon S3 RESTful API.

Swift-compatible: Provides object storage functionality with an interface that is compatiblewith a large subset of the OpenStack Swift API.

The Object Gateway daemon uses 'Beast' HTTP front-end by default. It uses the Boost.Beastlibrary for HTTP parsing and the Boost.Asio library for asynchronous network I/O operations.

Because Object Gateway provides interfaces compatible with OpenStack Swift and Amazon S3,the Object Gateway has its own user management. Object Gateway can store data in the samecluster that is used to store data from CephFS clients or RADOS Block Device clients. The S3and Swift APIs share a common name space, so you may write data with one API and retrieveit with the other.

Important: Object Gateway Deployed by DeepSeaObject Gateway is installed as a DeepSea role, therefore you do not need to install itmanually.

To install the Object Gateway during the cluster deployment, see Section 5.3, “Cluster De-

ployment”.

To add a new node with Object Gateway to the cluster, see Book “Administration Guide”,

Chapter 2 “Salt Cluster Administration”, Section 2.2 “Adding New Roles to Nodes”.

9.1 Object Gateway Manual Installation

1. Install Object Gateway on a node that is not using port 80. The following command installsall required components:

cephadm@ogw > sudo zypper ref && zypper in ceph-radosgw

110 Object Gateway Manual Installation SES 6

2. If the Apache server from the previous Object Gateway instance is running, stop it anddisable the relevant service:

cephadm@ogw > sudo systemctl stop disable apache2.service

3. Edit /etc/ceph/ceph.conf and add the following lines:

[client.rgw.gateway_host] rgw frontends = "beast port=80"

TipIf you want to configure Object Gateway/Beast for use with SSL encryption, modifythe line accordingly:

rgw frontends = beast ssl_port=7480 ssl_certificate=PATH_TO_CERTIFICATE.PEM

4. Restart the Object Gateway service.

cephadm@ogw > sudo systemctl restart [email protected]_host

9.1.1 Object Gateway Configuration

Several steps are required to configure an Object Gateway.

9.1.1.1 Basic Configuration

Configuring a Ceph Object Gateway requires a running Ceph Storage Cluster. The Ceph ObjectGateway is a client of the Ceph Storage Cluster. As a Ceph Storage Cluster client, it requires:

A host name for the gateway instance, for example gateway .

A storage cluster user name with appropriate permissions and a keyring.

Pools to store its data.

A data directory for the gateway instance.

An instance entry in the Ceph configuration le.

111 Object Gateway Configuration SES 6

Each instance must have a user name and key to communicate with a Ceph storage cluster. Inthe following steps, we use a monitor node to create a bootstrap keyring, then create the ObjectGateway instance user keyring based on the bootstrap one. Then, we create a client user nameand key. Next, we add the key to the Ceph Storage Cluster. Finally, we distribute the keyringto the node containing the gateway instance.

1. Create a keyring for the gateway:

cephadm@adm > ceph-authtool --create-keyring /etc/ceph/ceph.client.rgw.keyringcephadm@adm > sudo chmod +r /etc/ceph/ceph.client.rgw.keyring

2. Generate a Ceph Object Gateway user name and key for each instance. As an example, wewill use the name gateway after client.radosgw :

cephadm@adm > ceph-authtool /etc/ceph/ceph.client.rgw.keyring \ -n client.rgw.gateway --gen-key

3. Add capabilities to the key:

cephadm@adm > ceph-authtool -n client.rgw.gateway --cap osd 'allow rwx' \ --cap mon 'allow rwx' /etc/ceph/ceph.client.rgw.keyring

4. Once you have created a keyring and key to enable the Ceph Object Gateway with accessto the Ceph Storage Cluster, add the key to your Ceph Storage Cluster. For example:

cephadm@adm > ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.rgw.gateway \ -i /etc/ceph/ceph.client.rgw.keyring

5. Distribute the keyring to the node with the gateway instance:

cephadm@adm > scp /etc/ceph/ceph.client.rgw.keyring ceph@HOST_NAME:/home/cephcephadm@adm > ssh ceph@HOST_NAMEcephadm@ogw > mv ceph.client.rgw.keyring /etc/ceph/ceph.client.rgw.keyring

Tip: Use Bootstrap KeyringAn alternative way is to create the Object Gateway bootstrap keyring, and then createthe Object Gateway keyring from it:

1. Create an Object Gateway bootstrap keyring on one of the monitor nodes:

cephadm@mon > ceph \


auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' \ --connect-timeout=25 \ --cluster=ceph \ --name mon. \ --keyring=/var/lib/ceph/mon/ceph-NODE_HOST/keyring \ -o /var/lib/ceph/bootstrap-rgw/keyring

2. Create the /var/lib/ceph/radosgw/ceph-RGW_NAME directory for storing thebootstrap keyring:

cephadm@mon > mkdir \/var/lib/ceph/radosgw/ceph-RGW_NAME

3. Create an Object Gateway keyring from the newly created bootstrap keyring:

cephadm@mon > ceph \ auth get-or-create client.rgw.RGW_NAME osd 'allow rwx' mon 'allow rw' \ --connect-timeout=25 \ --cluster=ceph \ --name client.bootstrap-rgw \ --keyring=/var/lib/ceph/bootstrap-rgw/keyring \ -o /var/lib/ceph/radosgw/ceph-RGW_NAME/keyring

4. Copy the Object Gateway keyring to the Object Gateway host:

cephadm@mon > scp \/var/lib/ceph/radosgw/ceph-RGW_NAME/keyring \RGW_HOST:/var/lib/ceph/radosgw/ceph-RGW_NAME/keyring

9.1.1.2 Create Pools (Optional)

Ceph Object Gateways require Ceph Storage Cluster pools to store specific gateway data. Ifthe user you created has proper permissions, the gateway will create the pools automatically.However, ensure that you have set an appropriate default number of placement groups per poolin the Ceph configuration le.

The pool names follow the ZONE_NAME.POOL_NAME syntax. When configuring a gateway withthe default region and zone, the default zone name is 'default' as in our example:

.rgw.rootdefault.rgw.controldefault.rgw.meta


default.rgw.logdefault.rgw.buckets.indexdefault.rgw.buckets.data

To create the pools manually, see Book “Administration Guide”, Chapter 22 “Managing Storage Pools”,

Section 22.2.2 “Create a Pool”.

Important: Object Gateway and Erasure-Coded PoolsOnly the default.rgw.buckets.data pool can be erasure coded. All other pools needto be replicated, otherwise the gateway is not accessible.

9.1.1.3 Adding Gateway Configuration to Ceph

Add the Ceph Object Gateway configuration to the Ceph Configuration le. The Ceph ObjectGateway configuration requires you to identify the Ceph Object Gateway instance. Then, specifythe host name where you installed the Ceph Object Gateway daemon, a keyring (for use withcephx), and optionally a log le. For example:

[client.rgw.INSTANCE_NAME]host = HOST_NAMEkeyring = /etc/ceph/ceph.client.rgw.keyring

Tip: Object Gateway Log FileTo override the default Object Gateway log le, include the following:

log file = /var/log/radosgw/client.rgw.INSTANCE_NAME.log

The [client.rgw.*] portion of the gateway instance identifies this portion of the Ceph con-figuration le as configuring a Ceph Storage Cluster client where the client type is a Ceph ObjectGateway (radosgw). The instance name follows. For example:

[client.rgw.gateway]host = ceph-gatewaykeyring = /etc/ceph/ceph.client.rgw.keyring


NoteThe HOST_NAME must be your machine host name, excluding the domain name.

Then turn o print continue . If you have it set to true, you may encounter problems withPUT operations:

rgw print continue = false

To use a Ceph Object Gateway with subdomain S3 calls (for example http://bucket-

name.hostname ), you must add the Ceph Object Gateway DNS name under the [client.rg-w.gateway] section of the Ceph configuration le:

[client.rgw.gateway]...rgw dns name = HOST_NAME

You should also consider installing a DNS server such as Dnsmasq on your client machine(s)when using the http://BUCKET_NAME.HOST_NAME syntax. The dnsmasq.conf le should in-clude the following settings:

address=/HOST_NAME/HOST_IP_ADDRESSlisten-address=CLIENT_LOOPBACK_IP

Then, add the CLIENT_LOOPBACK_IP IP address as the rst DNS server on the client machine(s).

9.1.1.4 Create Data Directory

Deployment scripts may not create the default Ceph Object Gateway data directory. Create datadirectories for each instance of a radosgw daemon if not already done. The host variables inthe Ceph configuration le determine which host runs each instance of a radosgw daemon. Thetypical form specifies the radosgw daemon, the cluster name, and the daemon ID.

root # mkdir -p /var/lib/ceph/radosgw/CLUSTER_ID

Using the example ceph.conf settings above, you would execute the following:

root # mkdir -p /var/lib/ceph/radosgw/ceph-radosgw.gateway


9.1.1.5 Restart Services and Start the Gateway

To ensure that all components have reloaded their configurations, we recommend restartingyour Ceph Storage Cluster service. Then, start up the radosgw service. For more information,see Book “Administration Guide”, Chapter 15 “Introduction” and Book “Administration Guide”, Chapter 26

“Ceph Object Gateway”, Section 26.3 “Operating the Object Gateway Service”.

When the service is up and running, you can make an anonymous GET request to see if thegateway returns a response. A simple HTTP request to the domain name should return thefollowing:

<ListAllMyBucketsResult> <Owner> <ID>anonymous</ID> <DisplayName/> </Owner> <Buckets/></ListAllMyBucketsResult>


10 Installation of iSCSI Gateway

iSCSI is a storage area network (SAN) protocol that allows clients (called initiators) to sendSCSI commands to SCSI storage devices (targets) on remote servers. SUSE Enterprise Storage 6includes a facility that opens Ceph storage management to heterogeneous clients, such as Mi-crosoft Windows* and VMware* vSphere, through the iSCSI protocol. Multipath iSCSI accessenables availability and scalability for these clients, and the standardized iSCSI protocol alsoprovides an additional layer of security isolation between clients and the SUSE Enterprise Stor-age 6 cluster. The configuration facility is named ceph-iscsi . Using ceph-iscsi , Ceph stor-age administrators can define thin-provisioned, replicated, highly-available volumes supportingread-only snapshots, read-write clones, and automatic resizing with Ceph RADOS Block Device(RBD). Administrators can then export volumes either via a single ceph-iscsi gateway host,or via multiple gateway hosts supporting multipath failover. Linux, Microsoft Windows, andVMware hosts can connect to volumes using the iSCSI protocol, which makes them available likeany other SCSI block device. This means SUSE Enterprise Storage 6 customers can effectivelyrun a complete block-storage infrastructure subsystem on Ceph that provides all the featuresand benefits of a conventional SAN, enabling future growth.

This chapter introduces detailed information to set up a Ceph cluster infrastructure togetherwith an iSCSI gateway so that the client hosts can use remotely stored data as local storagedevices using the iSCSI protocol.

10.1 iSCSI Block StorageiSCSI is an implementation of the Small Computer System Interface (SCSI) command set usingthe Internet Protocol (IP), specified in RFC 3720. iSCSI is implemented as a service where aclient (the initiator) talks to a server (the target) via a session on TCP port 3260. An iSCSI target'sIP address and port are called an iSCSI portal, where a target can be exposed through one ormore portals. The combination of a target and one or more portals is called the target portalgroup (TPG).

The underlying data link layer protocol for iSCSI is commonly Ethernet. More specifically, mod-ern iSCSI infrastructures use 10 Gigabit Ethernet or faster networks for optimal throughput.10 Gigabit Ethernet connectivity between the iSCSI gateway and the back-end Ceph cluster isstrongly recommended.

117 iSCSI Block Storage SES 6

10.1.1 The Linux Kernel iSCSI Target

The Linux kernel iSCSI target was originally named LIO for linux-iscsi.org, the project's originaldomain and Web site. For some time, no fewer than four competing iSCSI target implementa-tions were available for the Linux platform, but LIO ultimately prevailed as the single iSCSIreference target. The mainline kernel code for LIO uses the simple, but somewhat ambiguousname "target", distinguishing between "target core" and a variety of front-end and back-end tar-get modules.

The most commonly used front-end module is arguably iSCSI. However, LIO also supports FibreChannel (FC), Fibre Channel over Ethernet (FCoE) and several other front-end protocols. At thistime, only the iSCSI protocol is supported by SUSE Enterprise Storage.

The most frequently used target back-end module is one that is capable of simply re-exportingany available block device on the target host. This module is named iblock. However, LIO al-so has an RBD-specific back-end module supporting parallelized multipath I/O access to RBDimages.

10.1.2 iSCSI Initiators

This section introduces brief information on iSCSI initiators used on Linux, Microsoft Windows,and VMware platforms.

10.1.2.1 Linux

The standard initiator for the Linux platform is open-iscsi . open-iscsi launches a daemon,iscsid , which the user can then use to discover iSCSI targets on any given portal, log in totargets, and map iSCSI volumes. iscsid communicates with the SCSI mid layer to create in-kernel block devices that the kernel can then treat like any other SCSI block device on the system.The open-iscsi initiator can be deployed in conjunction with the Device Mapper Multipath( dm-multipath ) facility to provide a highly available iSCSI block device.

10.1.2.2 Microsoft Windows and Hyper-V

The default iSCSI initiator for the Microsoft Windows operating system is the Microsoft iSCSIinitiator. The iSCSI service can be configured via a graphical user interface (GUI), and supportsmultipath I/O for high availability.

118 The Linux Kernel iSCSI Target SES 6

10.1.2.3 VMware

The default iSCSI initiator for VMware vSphere and ESX is the VMware ESX software iSCSIinitiator, vmkiscsi . When enabled, it can be configured either from the vSphere client, or usingthe vmkiscsi-tool command. You can then format storage volumes connected through thevSphere iSCSI storage adapter with VMFS, and use them like any other VM storage device. TheVMware initiator also supports multipath I/O for high availability.

10.2 General Information about ceph-iscsiceph-iscsi combines the benefits of RADOS Block Devices with the ubiquitous versatilityof iSCSI. By employing ceph-iscsi on an iSCSI target host (known as the iSCSI Gateway),any application that needs to make use of block storage can benefit from Ceph, even if it doesnot speak any Ceph client protocol. Instead, users can use iSCSI or any other target front-endprotocol to connect to an LIO target, which translates all target I/O to RBD storage operations.

FIGURE 10.1: CEPH CLUSTER WITH A SINGLE ISCSI GATEWAY

ceph-iscsi is inherently highly-available and supports multipath operations. Thus, down-stream initiator hosts can use multiple iSCSI gateways for both high availability and scalability.When communicating with an iSCSI configuration with more than one gateway, initiators mayload-balance iSCSI requests across multiple gateways. In the event of a gateway failing, beingtemporarily unreachable, or being disabled for maintenance, I/O will transparently continuevia another gateway.

119 General Information about ceph-iscsi SES 6

FIGURE 10.2: CEPH CLUSTER WITH MULTIPLE ISCSI GATEWAYS

10.3 Deployment ConsiderationsA minimum configuration of SUSE Enterprise Storage 6 with ceph-iscsi consists of the fol-lowing components:

A Ceph storage cluster. The Ceph cluster consists of a minimum of four physical servershosting at least eight object storage daemons (OSDs) each. In such a configuration, threeOSD nodes also double as a monitor (MON) host.

An iSCSI target server running the LIO iSCSI target, configured via ceph-iscsi .

An iSCSI initiator host, running open-iscsi (Linux), the Microsoft iSCSI Initiator (Mi-crosoft Windows), or any other compatible iSCSI initiator implementation.

120 Deployment Considerations SES 6

A recommended production configuration of SUSE Enterprise Storage 6 with ceph-iscsi con-sists of:

A Ceph storage cluster. A production Ceph cluster consists of any number of (typicallymore than 10) OSD nodes, each typically running 10-12 object storage daemons (OSDs),with no fewer than three dedicated MON hosts.

Several iSCSI target servers running the LIO iSCSI target, configured via ceph-iscsi .For iSCSI fail-over and load-balancing, these servers must run a kernel supporting thetarget_core_rbd module. Update packages are available from the SUSE Linux EnterpriseServer maintenance channel.

Any number of iSCSI initiator hosts, running open-iscsi (Linux), the Microsoft iSCSIInitiator (Microsoft Windows), or any other compatible iSCSI initiator implementation.

10.4 Installation and ConfigurationThis section describes steps to install and configure an iSCSI Gateway on top of SUSE EnterpriseStorage.

10.4.1 Deploy the iSCSI Gateway to a Ceph Cluster

You can deploy the iSCSI Gateway either during the Ceph cluster deployment process, or addit to an existing cluster using DeepSea.

To include the iSCSI Gateway during the cluster deployment process, refer to Section 5.5.1.2, “Role

Assignment”.

To add the iSCSI Gateway to an existing cluster, refer to Book “Administration Guide”, Chapter 2

“Salt Cluster Administration”, Section 2.2 “Adding New Roles to Nodes”.

10.4.2 Create RBD Images

RBD images are created in the Ceph store and subsequently exported to iSCSI. We recommendthat you use a dedicated RADOS pool for this purpose. You can create a volume from any hostthat is able to connect to your storage cluster using the Ceph rbd command line utility. Thisrequires the client to have at least a minimal ceph.conf configuration le, and appropriate CephXauthentication credentials.

121 Installation and Configuration SES 6

To create a new volume for subsequent export via iSCSI, use the rbd create command, spec-ifying the volume size in megabytes. For example, in order to create a 100 GB volume named'testvol' in the pool named 'iscsi-images', run:

cephadm@adm > rbd --pool iscsi-images create --size=102400 'testvol'

10.4.3 Export RBD Images via iSCSI

To export RBD images via iSCSI, you can use either Ceph Dashboard Web interface or the ceph-iscsi gwcli utility. In this section we will focus on gwcli only, demonstrating how to create aniSCSI target that exports an RBD image using the command line.

NoteOnly the following RBD image features are supported: layering , striping (v2) , ex-clusive-lock , fast-diff , and data-pool . RBD images with any other feature en-abled cannot be exported.

As root , start the iSCSI gateway command line interface:

root # gwcli

Go to iscsi-targets and create a target with the name iqn.2003-01.org.linux-isc-si.iscsi.SYSTEM-ARCH:testvol :

gwcli > /> cd /iscsi-targetsgwcli > /iscsi-targets> create iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol

Create the iSCSI gateways by specifying the gateway name and ip address:

gwcli > /iscsi-targets> cd iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol/gatewaysgwcli > /iscsi-target...tvol/gateways> create iscsi1 192.168.124.104gwcli > /iscsi-target...tvol/gateways> create iscsi2 192.168.124.105

TipUse the help command to show the list of available commands in the current configu-ration node.

122 Export RBD Images via iSCSI SES 6

Add the RBD image with the name 'testvol' in the pool 'iscsi-images':

gwcli > /iscsi-target...tvol/gateways> cd /disksgwcli > /disks> attach iscsi-images/testvol

Map the RBD image to the target:

gwcli > /disks> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol/disksgwcli > /iscsi-target...testvol/disks> add iscsi-images/testvol

NoteYou can use lower level tools, such as targetcli , to query the local configuration, butnot to modify it.

TipYou can use the ls command to review the configuration. Some configuration nodes alsosupport the info command, which can be used to display more detailed information.

Note that, by default, ACL authentication is enabled so this target is not accessible yet. CheckSection 10.4.4, “Authentication and Access Control” for more information about authentication andaccess control.

10.4.4 Authentication and Access Control

iSCSI authentication is flexible and covers many authentication possibilities.

10.4.4.1 No Authentication

'No authentication' means that any initiator will be able to access any LUNs on the correspondingtarget. You can enable 'No authentication' by disabling the ACL authentication:

gwcli > /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol/hostsgwcli > /iscsi-target...testvol/hosts> auth disable_acl

123 Authentication and Access Control SES 6

10.4.4.2 ACL Authentication

When using initiator name based ACL authentication, only the defined initiators are allowed toconnect. You can define an initiator by doing:

gwcli > /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol/hostsgwcli > /iscsi-target...testvol/hosts> create iqn.1996-04.de.suse:01:e6ca28cc9f20

Defined initiators will be able to connect, but will only have access to the RBD images that wereexplicitly added to the initiator:

gwcli > /iscsi-target...:e6ca28cc9f20> disk add rbd/testvol

10.4.4.3 CHAP Authentication

In addition to the ACL, you can enable the CHAP authentication by specifying a user name andpassword for each initiator:

gwcli > /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol/hosts/iqn.1996-04.de.suse:01:e6ca28cc9f20gwcli > /iscsi-target...:e6ca28cc9f20> auth username=common12 password=pass12345678

NoteUser names must have a length of 8 to 64 characters and can contain alphanumeric char-acters, '.', '@', '-', '_' or ':'.

Passwords must have a length of 12 to 16 characters and can contain alphanumeric char-acters, '@', '-', '_' or '/'.

Optionally, you can also enable the CHAP mutual authentication by specifying the mutu-al_username and mutual_password parameters in the auth command.

10.4.4.4 Discovery and Mutual Authentication

Discovery authentication is independent of the previous authentication methods. It requires cre-dentials for browsing, it is optional, and can be configured by:

gwcli > /> cd /iscsi-targetsgwcli > /iscsi-targets> discovery_auth username=du123456 password=dp1234567890

124 Authentication and Access Control SES 6

NoteUser-names must have a length of 8 to 64 characters and can only contain letters, '.', '@','-', '_' or ':'.

Passwords must have a length of 12 to 16 characters and can only contain letters, '@','-', '_' or '/'.

Optionally, you can also specify the mutual_username and mutual_password parameters inthe discovery_auth command.

Discovery authentication can be disabled by using the following command:

gwcli > /iscsi-targets> discovery_auth nochap

10.4.5 Advanced Settings

ceph-iscsi can be configured with advanced parameters which are subsequently passed on tothe LIO I/O target. The parameters are divided up into 'target' and 'disk' parameters.

WarningUnless otherwise noted, changing these parameters from the default setting is not recom-mended.

10.4.5.1 Target Settings

You can view the value of these settings by using the info command:

gwcli > /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvolgwcli > /iscsi-target...i.SYSTEM-ARCH:testvol> info

And change a setting using the reconfigure command:

gwcli > /iscsi-target...i.SYSTEM-ARCH:testvol> reconfigure login_timeout 20

The available 'target' settings are:

default_cmdsn_depth

125 Advanced Settings SES 6

Default CmdSN (Command Sequence Number) depth. Limits the amount of requests thatan iSCSI initiator can have outstanding at any moment.

default_erl

Default error recovery level.

login_timeout

Login timeout value in seconds.

netif_timeout

NIC failure timeout in seconds.

prod_mode_write_protect

If set to 1, prevents writes to LUNs.

10.4.5.2 Disk Settings

You can view the value of these settings by using the info command:

gwcli > /> cd /disks/rbd/testvolgwcli > /disks/rbd/testvol> info

And change a setting using the reconfigure command:

gwcli > /disks/rbd/testvol> reconfigure rbd/testvol emulate_pr 0

The available 'disk' settings are:

block_size

Block size of the underlying device.

emulate_3pc

If set to 1, enables Third Party Copy.

emulate_caw

If set to 1, enables Compare and Write.

emulate_dpo

If set to 1, turns on Disable Page Out.

emulate_fua_read

If set to 1, enables Force Unit Access read.


emulate_fua_write

If set to 1, enables Force Unit Access write.

emulate_model_alias

If set to 1, uses the back-end device name for the model alias.

emulate_pr

If set to 0, support for SCSI Reservations, including Persistent Group Reservations, is dis-abled. While disabled, the SES iSCSI Gateway can ignore reservation state, resulting inimproved request latency.

TipSetting backstore_emulate_pr to 0 is recommended if iSCSI initiators do not requireSCSI Reservation support.

emulate_rest_reord

If set to 0, the Queue Algorithm Modifier has Restricted Reordering.

emulate_tas

If set to 1, enables Task Aborted Status.

emulate_tpu

If set to 1, enables Thin Provisioning Unmap.

emulate_tpws

If set to 1, enables Thin Provisioning Write Same.

emulate_ua_intlck_ctrl

If set to 1, enables Unit Attention Interlock.

emulate_write_cache

If set to 1, turns on Write Cache Enable.

enforce_pr_isids

If set to 1, enforces persistent reservation ISIDs.

is_nonrot

If set to 1, the backstore is a non-rotational device.

max_unmap_block_desc_count


Maximum number of block descriptors for UNMAP.

max_unmap_lba_count:

Maximum number of LBAs for UNMAP.

max_write_same_len

Maximum length for WRITE_SAME.

optimal_sectors

Optimal request size in sectors.

pi_prot_type

DIF protection type.

queue_depth

Queue depth.

unmap_granularity

UNMAP granularity.

unmap_granularity_alignment

UNMAP granularity alignment.

force_pr_aptpl

When enabled, LIO will always write out the persistent reservation state to persistent stor-age, regardless of whether or not the client has requested it via aptpl=1 . This has noeffect with the kernel RBD back-end for LIO—it always persists PR state. Ideally, the tar-get_core_rbd option should force it to '1' and throw an error if someone tries to disableit via configfs.

unmap_zeroes_data

Affects whether LIO will advertise LBPRZ to SCSI initiators, indicating that zeros will beread back from a region following UNMAP or WRITE SAME with an unmap bit.

10.5 Exporting RADOS Block Device Images Usingtcmu-runnerThe ceph-iscsi supports both rbd (kernel-based) and user:rbd (tcmu-runner) backstores,making all the management transparent and independent of the backstore.

128 Exporting RADOS Block Device Images Using tcmu-runner SES 6

Warning: Technology Previewtcmu-runner based iSCSI Gateway deployments are currently a technology preview.

Unlike kernel-based iSCSI Gateway deployments, tcmu-runner based iSCSI Gateways do notoffer support for multipath I/O or SCSI Persistent Reservations.

To export an RADOS Block Device image using tcmu-runner , all you need to do is specify theuser:rbd backstore when attaching the disk:

gwcli > /disks> attach rbd/testvol backstore=user:rbd

NoteWhen using tcmu-runner , the exported RBD image must have the exclusive-lockfeature enabled.

129 Exporting RADOS Block Device Images Using tcmu-runner SES 6

11 Installation of CephFS

The Ceph le system (CephFS) is a POSIX-compliant le system that uses a Ceph storage clusterto store its data. CephFS uses the same cluster system as Ceph block devices, Ceph object storagewith its S3 and Swift APIs, or native bindings ( librados ).

To use CephFS, you need to have a running Ceph storage cluster, and at least one running Cephmetadata server.

11.1 Supported CephFS Scenarios and GuidanceWith SUSE Enterprise Storage 6, SUSE introduces official support for many scenarios in whichthe scale-out and distributed component CephFS is used. This entry describes hard limits andprovides guidance for the suggested use cases.

A supported CephFS deployment must meet these requirements:

Clients are SUSE Linux Enterprise Server 12 SP3 or newer, or SUSE Linux Enterprise Server15 or newer, using the cephfs kernel module driver. The FUSE module is not supported.

CephFS quotas are supported in SUSE Enterprise Storage 6 and can be set on any subdi-rectory of the Ceph le system. The quota restricts either the number of bytes or filesstored beneath the specified point in the directory hierarchy. For more information, seeBook “Administration Guide”, Chapter 28 “Clustered File System”, Section 28.6 “Setting CephFS Quo-

tas”.

CephFS supports le layout changes as documented in Section 11.3.4, “File Layouts”. However,while the le system is mounted by any client, new data pools may not be added to anexisting CephFS le system ( ceph mds add_data_pool ). They may only be added whilethe le system is unmounted.

A minimum of one Metadata Server. SUSE recommends deploying several nodes with theMDS role. By default, additional MDS daemons start as standby daemons, acting as back-ups for the active MDS. Multiple active MDS daemons are also supported (refer to sectionSection 11.3.2, “MDS Cluster Size”).

130 Supported CephFS Scenarios and Guidance SES 6

11.2 Ceph Metadata ServerCeph metadata server (MDS) stores metadata for the CephFS. Ceph block devices and Cephobject storage do not use MDS. MDSs make it possible for POSIX le system users to execute basiccommands—such as ls or find—without placing an enormous burden on the Ceph storagecluster.

11.2.1 Adding and Removing a Metadata Server

You can deploy MDS either during the initial cluster deployment process as described in Sec-

tion 5.3, “Cluster Deployment”, or add it to an already deployed cluster as described in Book “Ad-

ministration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.1 “Adding New Cluster Nodes”.

After you deploy your MDS, allow the Ceph OSD/MDS service in the firewall setting of the serverwhere MDS is deployed: Start yast , navigate to Security and Users Firewall Allowed Servicesand in the Service to Allow drop–down menu select Ceph OSD/MDS. If the Ceph MDS node isnot allowed full traffic, mounting of a le system fails, even though other operations may workproperly.

You can remove a metadata server in your cluster as described in .

11.2.2 Configuring a Metadata Server

You can ne-tune the MDS behavior by inserting relevant options in the ceph.conf configu-ration le.

METADATA SERVER SETTINGS

mon force standby active

If set to 'true' (default), monitors force standby-replay to be active. Set under [mon] or[global] sections.

mds cache memory limit

The soft memory limit (in bytes) that the MDS will enforce for its cache. Administratorsshould use this instead of the old mds cache size setting. Defaults to 1 GB.

mds cache reservation

The cache reservation (memory or inodes) for the MDS cache to maintain. When the MDSbegins touching its reservation, it will recall client state until its cache size shrinks torestore the reservation. Defaults to 0.05.

131 Ceph Metadata Server SES 6

mds cache size

The number of inodes to cache. A value of 0 (default) indicates an unlimited number. Itis recommended to use mds cache memory limit to limit the amount of memory theMDS cache uses.

mds cache mid

The insertion point for new items in the cache LRU (from the top). Default is 0.7.

mds dir commit ratio

The fraction of directory that is dirty before Ceph commits using a full update instead ofpartial update. Default is 0.5.

mds dir max commit size

The maximum size of a directory update before Ceph breaks it into smaller transactions.Default is 90 MB.

mds decay halflife

The half-life of MDS cache temperature. Default is 5.

mds beacon interval

The frequency in seconds of beacon messages sent to the monitor. Default is 4.

mds beacon grace

The interval without beacons before Ceph declares an MDS laggy and possibly replacesit. Default is 15.

mds blacklist interval

The blacklist duration for failed MDSs in the OSD map. This setting controls how long failedMDS daemons will stay in the OSD map blacklist. It has no effect on how long somethingis blacklisted when the administrator blacklists it manually. For example, the ceph osdblacklist add command will still use the default blacklist time. Default is 24 * 60.

mds reconnect timeout

The interval in seconds to wait for clients to reconnect during MDS restart. Default is 45.

mds tick interval

How frequently the MDS performs internal periodic tasks. Default is 5.

mds dirstat min interval

The minimum interval in seconds to try to avoid propagating recursive stats up the tree.Default is 1.

132 Configuring a Metadata Server SES 6

mds scatter nudge interval

How quickly dirstat changes propagate up. Default is 5.

mds client prealloc inos

The number of inode numbers to preallocate per client session. Default is 1000.

mds early reply

Determines whether the MDS should allow clients to see request results before they committo the journal. Default is 'true'.

mds use tmap

Use trivial map for directory updates. Default is 'true'.

mds default dir hash

The function to use for hashing les across directory fragments. Default is 2 (that is 'rjenk-ins').

mds log skip corrupt events

Determines whether the MDS should try to skip corrupt journal events during journalreplay. Default is 'false'.

mds log max events

The maximum events in the journal before we initiate trimming. Set to -1 (default) todisable limits.

mds log max segments

The maximum number of segments (objects) in the journal before we initiate trimming.Set to -1 to disable limits. Default is 30.

mds log max expiring

The maximum number of segments to expire in parallels. Default is 20.

mds log eopen size

The maximum number of inodes in an EOpen event. Default is 100.

mds bal sample interval

Determines how frequently to sample directory temperature for fragmentation decisions.Default is 3.

mds bal replicate threshold

The maximum temperature before Ceph attempts to replicate metadata to other nodes.Default is 8000.


mds bal unreplicate threshold

The minimum temperature before Ceph stops replicating metadata to other nodes. Defaultis 0.

mds bal split size

The maximum directory size before the MDS will split a directory fragment into smallerbits. Default is 10000.

mds bal split rd

The maximum directory read temperature before Ceph splits a directory fragment. Defaultis 25000.

mds bal split wr

The maximum directory write temperature before Ceph splits a directory fragment. Defaultis 10000.

mds bal split bits

The number of bits by which to split a directory fragment. Default is 3.

mds bal merge size

The minimum directory size before Ceph tries to merge adjacent directory fragments. De-fault is 50.

mds bal interval

The frequency in seconds of workload exchanges between MDSs. Default is 10.

mds bal fragment interval

The delay in seconds between a fragment being capable of splitting or merging, and exe-cution of the fragmentation change. Default is 5.

mds bal fragment fast factor

The ratio by which fragments may exceed the split size before a split is executed immedi-ately, skipping the fragment interval. Default is 1.5.

mds bal fragment size max

The maximum size of a fragment before any new entries are rejected with ENOSPC. Defaultis 100000.

mds bal idle threshold

The minimum temperature before Ceph migrates a subtree back to its parent. Default is 0.

mds bal mode


The method for calculating MDS load:

0 = Hybrid.

1 = Request rate and latency.

2 = CPU load.

Default is 0.

mds bal min rebalance

The minimum subtree temperature before Ceph migrates. Default is 0.1.

mds bal min start

The minimum subtree temperature before Ceph searches a subtree. Default is 0.2.

mds bal need min

The minimum fraction of target subtree size to accept. Default is 0.8.

mds bal need max

The maximum fraction of target subtree size to accept. Default is 1.2.

mds bal midchunk

Ceph will migrate any subtree that is larger than this fraction of the target subtree size.Default is 0.3.

mds bal minchunk

Ceph will ignore any subtree that is smaller than this fraction of the target subtree size.Default is 0.001.

mds bal target removal min

The minimum number of balancer iterations before Ceph removes an old MDS target fromthe MDS map. Default is 5.

mds bal target removal max

The maximum number of balancer iteration before Ceph removes an old MDS target fromthe MDS map. Default is 10.

mds replay interval

The journal poll interval when in standby-replay mode ('hot standby'). Default is 1.

mds shutdown check


The interval for polling the cache during MDS shutdown. Default is 0.

mds thrash fragments

Ceph will randomly fragment or merge directories. Default is 0.

mds dump cache on map

Ceph will dump the MDS cache contents to a le on each MDS map. Default is 'false'.

mds dump cache after rejoin

Ceph will dump MDS cache contents to a le after rejoining the cache during recovery.Default is 'false'.

mds standby for name

An MDS daemon will standby for another MDS daemon of the name specified in this setting.

mds standby for rank

An MDS daemon will standby for an MDS daemon of this rank. Default is -1.

mds standby replay

Determines whether a Ceph MDS daemon should poll and replay the log of an active MDS('hot standby'). Default is 'false'.

mds min caps per client

Set the minimum number of capabilities a client may hold. Default is 100.

mds max ratio caps per client

Set the maximum ratio of current caps that may be recalled during MDS cache pressure.Default is 0.8.

METADATA SERVER JOURNALER SETTINGS

journaler write head interval

How frequently to update the journal head object. Default is 15.

journaler prefetch periods

How many stripe periods to read ahead on journal replay. Default is 10.

journal prezero periods

How many stripe periods to zero ahead of write position. Default 10.

journaler batch interval

Maximum additional latency in seconds we incur artificially. Default is 0.001.


journaler batch max

Maximum number of bytes by which we will delay flushing. Default is 0.

11.3 CephFSWhen you have a healthy Ceph storage cluster with at least one Ceph metadata server, you cancreate and mount your Ceph le system. Ensure that your client has network connectivity anda proper authentication keyring.

11.3.1 Creating CephFS

A CephFS requires at least two RADOS pools: one for data and one for metadata. When config-uring these pools, you might consider:

Using a higher replication level for the metadata pool, as any data loss in this pool canrender the whole le system inaccessible.

Using lower-latency storage such as SSDs for the metadata pool, as this will improve theobserved latency of le system operations on clients.

When assigning a role-mds in the policy.cfg , the required pools are automatically created.You can manually create the pools cephfs_data and cephfs_metadata for manual perfor-mance tuning before setting up the Metadata Server. DeepSea will not create these pools if theyalready exist.

For more information on managing pools, see Book “Administration Guide”, Chapter 22 “Managing

Storage Pools”.

To create the two required pools—for example, 'cephfs_data' and 'cephfs_metadata'—with defaultsettings for use with CephFS, run the following commands:

cephadm@adm > ceph osd pool create cephfs_data pg_numcephadm@adm > ceph osd pool create cephfs_metadata pg_num

It is possible to use EC pools instead of replicated pools. We recommend to only use EC poolsfor low performance requirements and infrequent random access, for example cold storage,backups, archiving. CephFS on EC pools requires BlueStore to be enabled and the pool musthave the allow_ec_overwrite option set. This option can be set by running ceph osd poolset ec_pool allow_ec_overwrites true .

137 CephFS SES 6

Erasure coding adds significant overhead to le system operations, especially small updates.This overhead is inherent to using erasure coding as a fault tolerance mechanism. This penaltyis the trade o for significantly reduced storage space overhead.

When the pools are created, you may enable the le system with the ceph fs new command:

cephadm@adm > ceph fs new fs_name metadata_pool_name data_pool_name

For example:

cephadm@adm > ceph fs new cephfs cephfs_metadata cephfs_data

You can check that the le system was created by listing all available CephFSs:

cephadm@adm > ceph fs ls name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]

When the le system has been created, your MDS will be able to enter an active state. Forexample, in a single MDS system:

cephadm@adm > ceph mds state5: 1/1/1 up

Tip: More TopicsYou can nd more information of specific tasks—for example mounting, unmounting, andadvanced CephFS setup—in Book “Administration Guide”, Chapter 28 “Clustered File System”.

11.3.2 MDS Cluster Size

A CephFS instance can be served by multiple active MDS daemons. All active MDS daemonsthat are assigned to a CephFS instance will distribute the le system's directory tree betweenthemselves, and thus spread the load of concurrent clients. In order to add an active MDS daemonto a CephFS instance, a spare standby is needed. Either start an additional daemon or use anexisting standby instance.

The following command will display the current number of active and passive MDS daemons.

cephadm@adm > ceph mds stat

The following command sets the number of active MDSs to two in a le system instance.

138 MDS Cluster Size SES 6

cephadm@adm > ceph fs set fs_name max_mds 2

In order to shrink the MDS cluster prior to an update, two steps are necessary. First, set max_mdsso that only one instance remains:

cephadm@adm > ceph fs set fs_name max_mds 1

and after that, explicitly deactivate the other active MDS daemons:

cephadm@adm > ceph mds deactivate fs_name:rank

where rank is the number of an active MDS daemon of a le system instance, ranging from0 to max_mds -1.

We recommend at least one MDS is left as a standby daemon.

11.3.3 MDS Cluster and Updates

During Ceph updates, the feature ags on a le system instance may change (usually by addingnew features). Incompatible daemons (such as the older versions) are not able to function withan incompatible feature set and will refuse to start. This means that updating and restarting onedaemon can cause all other not yet updated daemons to stop and refuse to start. For this reason,we recommend shrinking the active MDS cluster to size one and stopping all standby daemonsbefore updating Ceph. The manual steps for this update procedure are as follows:

1. Update the Ceph related packages using zypper .

2. Shrink the active MDS cluster as described above to one instance and stop all standby MDSdaemons using their systemd units on all other nodes:

cephadm@mds > systemctl stop ceph-mds\*.service ceph-mds.target

3. Only then restart the single remaining MDS daemon, causing it to restart using the updatedbinary.

cephadm@mds > systemctl restart ceph-mds\*.service ceph-mds.target

4. Restart all other MDS daemons and reset the desired max_mds setting.

cephadm@mds > systemctl start ceph-mds.target

139 MDS Cluster and Updates SES 6

If you use DeepSea, it will follow this procedure in case the ceph package was updated duringstages 0 and 4. It is possible to perform this procedure while clients have the CephFS instancemounted and I/O is ongoing. Note however that there will be a very brief I/O pause while theactive MDS restarts. Clients will recover automatically.

It is good practice to reduce the I/O load as much as possible before updating an MDS cluster. Anidle MDS cluster will go through this update procedure quicker. Conversely, on a heavily loadedcluster with multiple MDS daemons it is essential to reduce the load in advance to prevent asingle MDS daemon from being overwhelmed by ongoing I/O.

11.3.4 File Layouts

The layout of a le controls how its contents are mapped to Ceph RADOS objects. You can readand write a le’s layout using virtual extended attributes or xattrs for shortly.

The name of the layout xattrs depends on whether a le is a regular le or a directory. Regularles’ layout xattrs are called ceph.file.layout , while directories’ layout xattrs are calledceph.dir.layout . Where examples refer to ceph.file.layout , substitute the .dir. part asappropriate when dealing with directories.

11.3.4.1 Layout Fields

The following attribute elds are recognized:

pool

ID or name of a RADOS pool in which a le’s data objects will be stored.

pool_namespace

RADOS namespace within a data pool to which the objects will be written. It is empty bydefault, meaning the default namespace.

stripe_unit

The size in bytes of a block of data used in the RAID 0 distribution of a le. All stripeunits for a le have equal size. The last stripe unit is typically incomplete—it representsthe data at the end of the le as well as the unused 'space' beyond it up to the end of thexed stripe unit size.

stripe_count

The number of consecutive stripe units that constitute a RAID 0 'stripe' of le data.

140 File Layouts SES 6

object_size

The size in bytes of RADOS objects into which the le data is chunked.

Tip: Object SizesRADOS enforces a configurable limit on object sizes. If you increase CephFS ob-ject sizes beyond that limit, then writes may not succeed. The OSD setting is os-d_max_object_size , which is 128 MB by default. Very large RADOS objects mayprevent smooth operation of the cluster, so increasing the object size limit past thedefault is not recommended.

11.3.4.2 Reading Layout with getfattr

Use the getfattr command to read the layout information of an example le file as a singlestring:

root # touch fileroot # getfattr -n ceph.file.layout file# file: fileceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=419430

Read individual layout elds:

root # getfattr -n ceph.file.layout.pool file# file: fileceph.file.layout.pool="cephfs_data"root # getfattr -n ceph.file.layout.stripe_unit file# file: fileceph.file.layout.stripe_unit="4194304"

Tip: Pool ID or NameWhen reading layouts, the pool will usually be indicated by name. However, in rare caseswhen pools have only just been created, the ID may be output instead.

Directories do not have an explicit layout until it is customized. Attempts to read the layout willfail if it has never been modified: this indicates that the layout of the next ancestor directorywith an explicit layout will be used.

root # mkdir dir


root # getfattr -n ceph.dir.layout dirdir: ceph.dir.layout: No such attributeroot # setfattr -n ceph.dir.layout.stripe_count -v 2 dirroot # getfattr -n ceph.dir.layout dir# file: dirceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"

11.3.4.3 Writing Layouts with setfattr

Use the setfattr command to modify the layout elds of an example le file :

cephadm@adm > ceph osd lspools0 rbd1 cephfs_data2 cephfs_metadataroot # setfattr -n ceph.file.layout.stripe_unit -v 1048576 fileroot # setfattr -n ceph.file.layout.stripe_count -v 8 file# Setting pool by ID:root # setfattr -n ceph.file.layout.pool -v 1 file# Setting pool by name:root # setfattr -n ceph.file.layout.pool -v cephfs_data file

Note: Empty FileWhen the layout elds of a le are modified using setfattr , this le needs to be emptyotherwise an error will occur.

11.3.4.4 Clearing Layouts

If you want to remove an explicit layout from an example directory mydir and revert back toinheriting the layout of its ancestor, run the following:

root # setfattr -x ceph.dir.layout mydir

Similarly, if you have set the 'pool_namespace' attribute and wish to modify the layout to usethe default namespace instead, run:

# Create a directory and set a namespace on itroot # mkdir mydirroot # setfattr -n ceph.dir.layout.pool_namespace -v foons mydirroot # getfattr -n ceph.dir.layout mydirceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 \


pool=cephfs_data_a pool_namespace=foons"

# Clear the namespace from the directory's layoutroot # setfattr -x ceph.dir.layout.pool_namespace mydirroot # getfattr -n ceph.dir.layout mydirceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 \ pool=cephfs_data_a"

11.3.4.5 Inheritance of Layouts

Files inherit the layout of their parent directory at creation time. However, subsequent changesto the parent directory’s layout do not affect children:

root # getfattr -n ceph.dir.layout dir# file: dirceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \ pool=cephfs_data"

# file1 inherits its parent's layoutroot # touch dir/file1root # getfattr -n ceph.file.layout dir/file1# file: dir/file1ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \ pool=cephfs_data"

# update the layout of the directory before creating a second fileroot # setfattr -n ceph.dir.layout.stripe_count -v 4 dirroot # touch dir/file2

# file1's layout is unchangedroot # getfattr -n ceph.file.layout dir/file1# file: dir/file1ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \ pool=cephfs_data"

# ...while file2 has the parent directory's new layoutroot # getfattr -n ceph.file.layout dir/file2# file: dir/file2ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \ pool=cephfs_data"

Files created as descendants of the directory also inherit its layout if the intermediate directoriesdo not have layouts set:

root # getfattr -n ceph.dir.layout dir


# file: dirceph.dir.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \ pool=cephfs_data"root # mkdir dir/childdirroot # getfattr -n ceph.dir.layout dir/childdirdir/childdir: ceph.dir.layout: No such attributeroot # touch dir/childdir/grandchildroot # getfattr -n ceph.file.layout dir/childdir/grandchild# file: dir/childdir/grandchildceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \ pool=cephfs_data"

11.3.4.6 Adding a Data Pool to the Metadata Server

Before you can use a pool with CephFS, you need to add it to the Metadata Server:

cephadm@adm > ceph fs add_data_pool cephfs cephfs_data_ssdcephadm@adm > ceph fs ls # Pool should now show up.... data pools: [cephfs_data cephfs_data_ssd ]

Tip: cephx KeysMake sure that your cephx keys allow the client to access this new pool.

You can then update the layout on a directory in CephFS to use the pool you added:

root # mkdir /mnt/cephfs/myssddirroot # setfattr -n ceph.dir.layout.pool -v cephfs_data_ssd /mnt/cephfs/myssddir

All new les created within that directory will now inherit its layout and place their data inyour newly added pool. You may notice that the number of objects in your primary data poolcontinues to increase, even if les are being created in the pool you newly added. This is normal:the le data is stored in the pool specified by the layout, but a small amount of metadata is keptin the primary data pool for all les.


12 Installation of NFS Ganesha

NFS Ganesha provides NFS access to either the Object Gateway or the CephFS. In SUSE Enter-prise Storage 6, NFS versions 3 and 4 are supported. NFS Ganesha runs in the user space insteadof the kernel space and directly interacts with the Object Gateway or CephFS.

Warning: Cross Protocol AccessNative CephFS and NFS clients are not restricted by le locks obtained via Samba, and viceversa. Applications that rely on cross protocol le locking may experience data corruptionif CephFS backed Samba share paths are accessed via other means.

12.1 Preparation

12.1.1 General Information

To successfully deploy NFS Ganesha, you need to add a role-ganesha to your /srv/pil-lar/ceph/proposals/policy.cfg . For details, see Section 5.5.1, “The policy.cfg File”. NFSGanesha also needs either a role-rgw or a role-mds present in the policy.cfg .

Although it is possible to install and run the NFS Ganesha server on an already existing Cephnode, we recommend running it on a dedicated host with access to the Ceph cluster. The clienthosts are typically not part of the cluster, but they need to have network access to the NFSGanesha server.

To enable the NFS Ganesha server at any point after the initial installation, add the role-gane-sha to the policy.cfg and re-run at least DeepSea stages 2 and 4. For details, see Section 5.3,

“Cluster Deployment”.

NFS Ganesha is configured via the le /etc/ganesha/ganesha.conf that exists on the NFSGanesha node. However, this le is overwritten each time DeepSea stage 4 is executed. There-fore we recommend to edit the template used by Salt, which is the le /srv/salt/ceph/gane-sha/files/ganesha.conf.j2 on the Salt master. For details about the configuration le, seeBook “Administration Guide”, Chapter 30 “NFS Ganesha: Export Ceph Data via NFS”, Section 30.2 “Config-

uration”.

145 Preparation SES 6

12.1.2 Summary of Requirements

The following requirements need to be met before DeepSea stages 2 and 4 can be executed toinstall NFS Ganesha:

At least one node needs to be assigned the role-ganesha .

You can define only one role-ganesha per minion.

NFS Ganesha needs either an Object Gateway or CephFS to work.

The kernel based NFS needs to be disabled on minions with the role-ganesha role.

12.2 Example InstallationThis procedure provides an example installation that uses both the Object Gateway and CephFSFile System Abstraction Layers (FSAL) of NFS Ganesha.

1. If you have not done so, execute DeepSea stages 0 and 1 before continuing with thisprocedure.

root@master # salt-run state.orch ceph.stage.0root@master # salt-run state.orch ceph.stage.1

2. After having executed stage 1 of DeepSea, edit the /srv/pillar/ceph/proposals/pol-icy.cfg and add the line

role-ganesha/cluster/NODENAME

Replace NODENAME with the name of a node in your cluster.Also make sure that a role-mds and a role-rgw are assigned.

3. Execute at least stages 2 and 4 of DeepSea. Running stage 3 in between is recommended.

root@master # salt-run state.orch ceph.stage.2root@master # salt-run state.orch ceph.stage.3 # optional but recommendedroot@master # salt-run state.orch ceph.stage.4

4. Verify that NFS Ganesha is working by checking that the NFS Ganesha service is runningon the minion node:

root@master # salt -I roles:ganesha service.status nfs-ganesha

146 Summary of Requirements SES 6

MINION_ID: True

12.3 High Availability Active-Passive ConfigurationThis section provides an example of how to set up a two-node active-passive configuration ofNFS Ganesha servers. The setup requires the SUSE Linux Enterprise High Availability Extension.The two nodes are called earth and mars .

Important: Co-location of ServicesServices that have their own fault tolerance and their own load balancing should not berunning on cluster nodes that get fenced for failover services. Therefore, do not run CephMonitor, Metadata Server, iSCSI, or Ceph OSD services on High Availability setups.

For details about SUSE Linux Enterprise High Availability Extension, see https://documenta-

tion.suse.com/sle-ha/15-SP1/ .

12.3.1 Basic Installation

In this setup earth has the IP address 192.168.1.1 and mars has the address 192.168.1.2 .

Additionally, two floating virtual IP addresses are used, allowing clients to connect to the serviceindependent of which physical node it is running on. 192.168.1.10 is used for cluster admin-istration with Hawk2 and 192.168.2.1 is used exclusively for the NFS exports. This makes iteasier to apply security restrictions later.

The following procedure describes the example installation. More details can be found at https://

documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-install-quick/ .

1. Prepare the NFS Ganesha nodes on the Salt master:

a. Run DeepSea stages 0 and 1.

root@master # salt-run state.orch ceph.stage.0root@master # salt-run state.orch ceph.stage.1

147 High Availability Active-Passive Configuration SES 6

https://documentation.suse.com/sle-ha/15-SP1/

https://documentation.suse.com/sle-ha/15-SP1/

https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-install-quick/

https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-install-quick/

b. Assign the nodes earth and mars the role-ganesha in the /srv/pillar/ceph/proposals/policy.cfg :

role-ganesha/cluster/earth*.slsrole-ganesha/cluster/mars*.sls

c. Run DeepSea stages 2 to 4.

root@master # salt-run state.orch ceph.stage.2root@master # salt-run state.orch ceph.stage.3root@master # salt-run state.orch ceph.stage.4

2. Register the SUSE Linux Enterprise High Availability Extension on earth and mars .

root # SUSEConnect -r ACTIVATION_CODE -e E_MAIL

3. Install ha-cluster-bootstrap on both nodes:

root # zypper in ha-cluster-bootstrap

4. a. Initialize the cluster on earth :

root@earth # ha-cluster-init

b. Let mars join the cluster:

root@mars # ha-cluster-join -c earth

5. Check the status of the cluster. You should see two nodes added to the cluster:

root@earth # crm status

6. On both nodes, disable the automatic start of the NFS Ganesha service at boot time:

root # systemctl disable nfs-ganesha

7. Start the crm shell on earth :

root@earth # crm configure

The next commands are executed in the crm shell.

148 Basic Installation SES 6

8. On earth , run the crm shell to execute the following commands to configure the resourcefor NFS Ganesha daemons as clone of systemd resource type:

crm(live)configure# primitive nfs-ganesha-server systemd:nfs-ganesha \op monitor interval=30scrm(live)configure# clone nfs-ganesha-clone nfs-ganesha-server meta interleave=truecrm(live)configure# commitcrm(live)configure# status 2 nodes configured 2 resources configured

Online: [ earth mars ]

Full list of resources: Clone Set: nfs-ganesha-clone [nfs-ganesha-server] Started: [ earth mars ]

9. Create a primitive IPAddr2 with the crm shell:

crm(live)configure# primitive ganesha-ip IPaddr2 \params ip=192.168.2.1 cidr_netmask=24 nic=eth0 \op monitor interval=10 timeout=20

crm(live)# statusOnline: [ earth mars ]Full list of resources: Clone Set: nfs-ganesha-clone [nfs-ganesha-server] Started: [ earth mars ] ganesha-ip (ocf::heartbeat:IPaddr2): Started earth

10. To set up a relationship between the NFS Ganesha server and the floating Virtual IP, weuse collocation and ordering.

crm(live)configure# colocation ganesha-ip-with-nfs-ganesha-server inf: ganesha-ip nfs-ganesha-clonecrm(live)configure# order ganesha-ip-after-nfs-ganesha-server Mandatory: nfs-ganesha-clone ganesha-ip

11. Use the mount command from the client to ensure that cluster setup is complete:

root # mount -t nfs -v -o sync,nfsvers=4 192.168.2.1:/ /mnt

149 Basic Installation SES 6

12.3.2 Clean Up Resources

In the event of an NFS Ganesha failure at one of the node, for example earth , x the issue andclean up the resource. Only after the resource is cleaned up can the resource fail back to earthin case NFS Ganesha fails at mars .

To clean up the resource:

root@earth # crm resource cleanup nfs-ganesha-clone earthroot@earth # crm resource cleanup ganesha-ip earth

12.3.3 Setting Up Ping Resource

It may happen that the server is unable to reach the client because of a network issue. A pingresource can detect and mitigate this problem. Configuring this resource is optional.

1. Define the ping resource:

crm(live)configure# primitive ganesha-ping ocf:pacemaker:ping \ params name=ping dampen=3s multiplier=100 host_list="CLIENT1 CLIENT2" \ op monitor interval=60 timeout=60 \ op start interval=0 timeout=60 \ op stop interval=0 timeout=60

host_list is a list of IP addresses separated by space characters. The IP addresses willbe pinged regularly to check for network outages. If a client must always have access tothe NFS server, add it to host_list .

2. Create a clone:

crm(live)configure# clone ganesha-ping-clone ganesha-ping \ meta interleave=true

3. The following command creates a constraint for the NFS Ganesha service. It forces theservice to move to another node when host_list is unreachable.

crm(live)configure# location nfs-ganesha-server-with-ganesha-ping nfs-ganesha-clone \ rule -inf: not_defined ping or ping lte 0

150 Clean Up Resources SES 6

12.3.4 Setting Up PortBlock Resource

When a service goes down, the TCP connection that is in use by NFS Ganesha is required tobe closed otherwise it continues to run until a system-specific timeout occurs. This timeout cantake upwards of 3 minutes.

To shorten the timeout time, the TCP connection needs to be reset. We recommend configuringportblock to reset stale TCP connections.

You can choose to use portblock with or without the tickle_dir parameters that could unblockand reconnect clients to the new service faster. We recommend to have tickle_dir as theshared CephFS mount between two HA nodes (where NFS Ganesha services are running).

NoteConfiguring the following resource is optional.

1. On earth , run the crm shell to execute the following commands to configure the resourcefor NFS Ganesha daemons:

root@earth # crm configure

2. Configure the block action for portblock and omit the tickle_dir option if you havenot configured a shared directory:

crm(live)configure# primitive nfs-ganesha-block ocf:portblock \protocol=tcp portno=2049 action=block ip=192.168.2.1 op monitor depth="0" timeout="10" interval="10" tickle_dir="/tmp/ganesha/tickle/"

3. Configure the unblock action for portblock and omit the reset_local_on_un-

block_stop option if you have not configured a shared directory:

crm(live)configure# primitive nfs-ganesha-unblock ocf:portblock \protocol=tcp portno=2049 action=unblock ip=192.168.2.1 op monitor depth="0" timeout="10" interval="10" reset_local_on_unblock_stop=true tickle_dir="/tmp/ganesha/tickle/"

4. Configure the IPAddr2 resource with portblock :

crm(live)configure# colocation ganesha-portblock inf: ganesha-ip nfs-ganesha-block nfs-ganesha-unblockcrm(live)configure# edit ganesha-ip-after-nfs-ganesha-server

151 Setting Up PortBlock Resource SES 6

order ganesha-ip-after-nfs-ganesha-server Mandatory: nfs-ganesha-block nfs-ganesha-clone ganesha-ip nfs-ganesha-unblock

5. Save your changes:

crm(live)configure# commit

6. Your configuration should look like this:

crm(live)configure# show

"node 1084782956: nfs1node 1084783048: nfs2primitive ganesha-ip IPaddr2 \ params ip=192.168.2.1 cidr_netmask=24 nic=eth0 \ op monitor interval=10 timeout=20primitive nfs-ganesha-block portblock \ params protocol=tcp portno=2049 action=block ip=192.168.2.1 \ tickle_dir="/tmp/ganesha/tickle/" op monitor timeout=10 interval=10 depth=0primitive nfs-ganesha-server systemd:nfs-ganesha \ op monitor interval=30sprimitive nfs-ganesha-unblock portblock \ params protocol=tcp portno=2049 action=unblock ip=192.168.2.1 \ reset_local_on_unblock_stop=true tickle_dir="/tmp/ganesha/tickle/" \ op monitor timeout=10 interval=10 depth=0clone nfs-ganesha-clone nfs-ganesha-server \ meta interleave=truelocation cli-prefer-ganesha-ip ganesha-ip role=Started inf: nfs1order ganesha-ip-after-nfs-ganesha-server Mandatory: nfs-ganesha-block nfs-ganesha-clone ganesha-ip nfs-ganesha-unblockcolocation ganesha-ip-with-nfs-ganesha-server inf: ganesha-ip nfs-ganesha-clonecolocation ganesha-portblock inf: ganesha-ip nfs-ganesha-block nfs-ganesha-unblockproperty cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.16-6.5.1-77ea74d \ cluster-infrastructure=corosync \ cluster-name=hacluster \ stonith-enabled=false \ placement-strategy=balanced \ last-lrm-refresh=1544793779rsc_defaults rsc-options: \ resource-stickiness=1 \ migration-threshold=3op_defaults op-options: \ timeout=600 \

152 Setting Up PortBlock Resource SES 6

record-pending=true"

In this example /tmp/ganesha/ is the CephFS mount on both nodes (nfs1 and nfs2):

172.16.1.11:6789:/ganesha on /tmp/ganesha type ceph (rw,relatime,name=admin,secret=...hidden...,acl,wsize=16777216)

The tickle directory has been initially created.

12.3.5 NFS Ganesha HA and DeepSea

DeepSea does not support configuring NFS Ganesha HA. To prevent DeepSea from failing afterNFS Ganesha HA is configured, exclude starting and stopping the NFS Ganesha service fromDeepSea stage 4:

1. Copy /srv/salt/ceph/ganesha/default.sls to /srv/salt/ceph/ganesha/ha.sls .

2. Remove the .service entry from /srv/salt/ceph/ganesha/ha.sls so that it looks asfollows:

include:- .keyring- .install- .configure

3. Add the following line to /srv/pillar/ceph/stack/global.yml :

ganesha_init: ha

To prevent DeepSea from restarting NFS Ganesha service on stage 4:

1. Copy /srv/salt/ceph/stage/ganesha/default.sls to /srv/salt/ceph/stage/

ganesha/ha.sls .

2. Remove the line - ...restart.ganesha.lax from the /srv/salt/ceph/stage/gane-sha/ha.sls so it looks as follows:

include: - .migrate - .core

153 NFS Ganesha HA and DeepSea SES 6

3. Add the following line to /srv/pillar/ceph/stack/global.yml :

stage_ganesha: ha

12.4 Active-Active ConfigurationThis section provides an example of simple active-active NFS Ganesha setup. The aim is todeploy two NFS Ganesha servers layered on top of the same existing CephFS. The servers willbe two Ceph cluster nodes with separate addresses. The clients need to be distributed betweenthem manually. “Failover” in this configuration means manually unmounting and remountingthe other server on the client.

12.4.1 Prerequisites

For our example configuration, you need the following:

Running Ceph cluster. See Section 5.3, “Cluster Deployment” for details on deploying andconfiguring Ceph cluster by using DeepSea.

At least one configured CephFS. See Chapter 11, Installation of CephFS for more details ondeploying and configuring CephFS.

Two Ceph cluster nodes with NFS Ganesha deployed. See Chapter 12, Installation of NFS

Ganesha for more details on deploying NFS Ganesha.

Tip: Use Dedicated ServersAlthough NFS Ganesha nodes can share resources with other Ceph related services,we recommend to use dedicated servers to improve performance.

After you deploy the NFS Ganesha nodes, verify that the cluster is operational and the defaultCephFS pools are there:

cephadm@adm > rados lspoolscephfs_datacephfs_metadata

154 Active-Active Configuration SES 6

12.4.2 Configure NFS Ganesha

Check that both NFS Ganesha nodes have the le /etc/ganesha/ganesha.conf installed. Addthe following blocks, if they do not exist yet, to the configuration le in order to enable RADOSas the recovery backend of NFS Ganesha.

NFS_CORE_PARAM{ Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4;}NFSv4{ RecoveryBackend = rados_cluster; Minor_Versions = 1,2;}CACHEINODE { Dir_Chunk = 0; NParts = 1; Cache_Size = 1;}RADOS_KV{ pool = "rados_pool"; namespace = "pool_namespace"; nodeid = "fqdn" UserId = "cephx_user_id"; Ceph_Conf = "path_to_ceph.conf"}

You can nd out the values for rados_pool and pool_namespace by checking the alreadyexisting line in the configuration of the form:

%url rados://rados_pool/pool_namespace/...

The value for nodeid option corresponds to the FQDN of the machine, and UserId andCeph_Conf options value can be found in the already existing RADOS_URLS block.

Because legacy versions of NFS prevent us from lifting the grace period early and thereforeprolong a server restart, we disable options for NFS prior to version 4.2. We also disable mostof the NFS Ganesha caching as Ceph libraries do aggressive caching already.

The 'rados_cluster' recovery back-end stores its info in RADOS objects. Although it is not a lotof data, we want it highly available. We use the CephFS metadata pool for this purpose, anddeclare a new 'ganesha' namespace in it to keep it distinct from CephFS objects.

155 Configure NFS Ganesha SES 6

Note: Cluster Node IDsMost of the configuration is identical between the two hosts, however the nodeid op-tion in the 'RADOS_KV' block needs to be a unique string for each node. By default, NFSGanesha sets nodeid to the host name of the node.

If you need to use different xed values other than host names, you can for example setnodeid = 'a' on one node and nodeid = 'b' on the other one.

12.4.3 Populate the Cluster Grace Database

We need to verify that all of the nodes in the cluster know about each other. This done via aRADOS object that is shared between the hosts. NFS Ganesha uses this object to communicatethe current state with regard to a grace period.

The nfs-ganesha-rados-grace package contains a command line tool for querying and ma-nipulating this database. If the package is not installed on at least one of the nodes, install it with

root # zypper install nfs-ganesha-rados-grace

We will use the command to create the DB and add both nodeid s. In our example, the twoNFS Ganesha nodes are named ses6min1.example.com and ses6min2.example.com On oneof the NFS Ganesha hosts, run

cephadm@adm > ganesha-rados-grace -p cephfs_metadata -n ganesha add ses6min1.example.comcephadm@adm > ganesha-rados-grace -p cephfs_metadata -n ganesha add ses6min2.example.comcephadm@adm > ganesha-rados-grace -p cephfs_metadata -n ganeshacur=1 rec=0======================================================ses6min1.example.com Eses6min2.example.com E

This creates the grace database and adds both 'ses6min1.example.com' and 'ses6min2.exam-ple.com' to it. The last command dumps the current state. Newly added hosts are always con-sidered to be enforcing the grace period so they both have the 'E' ag set. The 'cur' and 'rec'values show the current and recovery epochs, which is how we keep track of what hosts areallowed to perform recovery and when.

156 Populate the Cluster Grace Database SES 6

12.4.4 Restart NFS Ganesha Services

On both NFS Ganesha nodes, restart the related services:

root # systemctl restart nfs-ganesha.service

After the services are restarted, check the grace database:

cephadm@adm > ganesha-rados-grace -p cephfs_metadata -n ganeshacur=3 rec=0======================================================ses6min1.example.comses6min2.example.com

Note: Cleared the 'E' FlagNote that both nodes have cleared their 'E' ags, indicating that they are no longer en-forcing the grace period and are now in normal operation mode.

12.4.5 Conclusion

After you complete all the preceding steps, you can mount the exported NFS from either of thetwo NFS Ganesha servers, and perform normal NFS operations against them.

Our example configuration assumes that if one of the two NFS Ganesha servers goes down, youwill restart it manually within 5 minutes. After 5 minutes, the Metadata Server may cancel thesession that the NFS Ganesha client held and all of the state associated with it. If the session’scapabilities get cancelled before the rest of the cluster goes into the grace period, the server’sclients may not be able to recover all of their state.

12.5 More InformationMore information can be found in Book “Administration Guide”, Chapter 30 “NFS Ganesha: Export

Ceph Data via NFS”.

157 Restart NFS Ganesha Services SES 6

IV Cluster Deployment on Top of SUSECaaS Platform 4 (Technology Preview)

13 SUSE Enterprise Storage 6 on Top of SUSE CaaS Platform 4 KubernetesCluster 159

13 SUSE Enterprise Storage 6 on Top of SUSE CaaSPlatform 4 Kubernetes Cluster

Warning: Technology PreviewRunning containerized Ceph cluster on SUSE CaaS Platform is a technology preview. Donot deploy on a production Kubernetes cluster. This is not a supported version.

This chapter describes how to deploy containerized SUSE Enterprise Storage 6 on top of SUSECaaS Platform 4 Kubernetes cluster.

13.1 ConsiderationsBefore you start deploying, consider the following points:

To run Ceph in Kubernetes, SUSE Enterprise Storage 6 uses an upstream project calledRook (https://rook.io/ ).

Depending on the configuration, Rook may consume all unused disks on all nodes in aKubernetes cluster.

The setup requires privileged containers.

13.2 PrerequisitesThe minimum requirements and prerequisites to deploy SUSE Enterprise Storage 6 on top ofSUSE CaaS Platform 4 Kubernetes cluster are as follows:

A running SUSE CaaS Platform 4 cluster. You need to have an account witha SUSE CaaS Platform subscription. You can activate a 60-day free evalua-tion here https://www.suse.com/products/caas-platform/download/MkpwEt3Ub98~/?cam-

paign_name=Eval:_CaaSP_4 .

At least three SUSE CaaS Platform worker nodes, with at least one additional disk attachedto each worker node as storage for the OSD. We recommend four SUSE CaaS Platformworker nodes.

159 Considerations SES 6

https://rook.io/

https://www.suse.com/products/caas-platform/download/MkpwEt3Ub98~/?campaign_name=Eval:_CaaSP_4

https://www.suse.com/products/caas-platform/download/MkpwEt3Ub98~/?campaign_name=Eval:_CaaSP_4

At least one OSD per worker node, with a minimum disk size of 5 GB.

Access to SUSE Enterprise Storage 6. You can get a trial subscription from here https://

www.suse.com/products/suse-enterprise-storage/download/ .

Access to a workstation that has access to the SUSE CaaS Platform cluster via kubectl .We recommend using the SUSE CaaS Platform master node as the workstation.

Ensure that the SUSE-Enterprise-Storage-6-Pool and SUSE-Enterprise-Stor-

age-6-Updates repositories are configured on the management node to install the rook-k8s-yaml RPM package.

13.3 Get Rook ManifestsThe Rook orchestrator uses configuration les in YAML format called manifests. The manifestsyou need are included in the rook-k8s-yaml RPM package. You can nd this package in theSUSE Enterprise Storage 6 repository. Install it by running the following:

root # zypper install rook-k8s-yaml

13.4 InstallationRook-Ceph includes two main components: the 'operator' which is run by Kubernetes and allowscreation of Ceph clusters, and the Ceph 'cluster' itself which is created and partially managedby the operator.

13.4.1 Configuration

13.4.1.1 Global Configuration

The manifests used in this setup install all Rook and Ceph components in the 'rook-ceph' name-space. If you need to change it, adopt all references to the namespace in the Kubernetes man-ifests accordingly.

Depending on which features of Rook you intend to use, alter the 'Pod Security Policy' config-uration in common.yaml to limit Rook's security requirements. Follow the comments in themanifest le.

160 Get Rook Manifests SES 6

https://www.suse.com/products/suse-enterprise-storage/download/

https://www.suse.com/products/suse-enterprise-storage/download/

13.4.1.2 Operator Configuration

The manifest operator.yaml configures the Rook operator. Normally, you do not need tochange it. Find more information following the comments in the manifest le.

13.4.1.3 Ceph Cluster Configuration

The manifest cluster.yaml is responsible for configuring the actual Ceph cluster which willrun in Kubernetes. Find detailed description of all available options in the upstream Rook doc-umentation at https://rook.io/docs/rook/v1.0/ceph-cluster-crd.html .

By default, Rook is configured to use all nodes that are not tainted with node-role.ku-bernetes.io/master:NoSchedule and will obey configured placement settings (see https://

rook.io/docs/rook/v1.0/ceph-cluster-crd.html#placement-configuration-settings ). The followingexample disables such behavior and only uses the nodes explicitly listed in the nodes section:

storage: useAllNodes: false nodes: - name: caasp4-worker-0 - name: caasp4-worker-1 - name: caasp4-worker-2

NoteBy default, Rook is configured to use all free and empty disks on each node for use asCeph storage.

13.4.1.4 Documentation

The Rook-Ceph upstream documentation at https://rook.github.io/docs/rook/master/ceph-

storage.html contains more detailed information about configuring more advanced de-ployments. Use it as a reference for understanding the basics of Rook-Ceph before doingmore advanced configurations.

Find more details about the SUSE CaaS Platform product at https://documenta-

tion.suse.com/suse-caasp/4.0/ .

161 Configuration SES 6

https://rook.io/docs/rook/v1.0/ceph-cluster-crd.html

https://rook.io/docs/rook/v1.0/ceph-cluster-crd.html#placement-configuration-settings

https://rook.io/docs/rook/v1.0/ceph-cluster-crd.html#placement-configuration-settings

https://rook.github.io/docs/rook/master/ceph-storage.html

https://rook.github.io/docs/rook/master/ceph-storage.html

https://documentation.suse.com/suse-caasp/4.0/

https://documentation.suse.com/suse-caasp/4.0/

13.4.2 Create the Rook Operator

Install the Rook-Ceph common components, CSI roles, and the Rook-Ceph operator by executingthe following command on the SUSE CaaS Platform master node:

root # kubectl apply -f common.yaml -f operator.yaml

common.yaml will create the 'rook-ceph' namespace, Ceph Custom Resource Definitions (CRDs)(see https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ )to make Kubernetes aware of Ceph Objects (for example, 'CephCluster'), and the RBAC rolesand Pod Security Policies (see https://kubernetes.io/docs/concepts/policy/pod-security-policy/ )which are necessary for allowing Rook to manage the cluster-specific resources.

Tip: hostNetwork and hostPorts UsageAllowing the usage of hostNetwork is required when using hostNetwork: true in theCluster Resource Definition. Allowing the usage of hostPorts in the PodSecurityPol-icy is also required.

Verify the installation by running kubectl get pods -n rook-ceph on SUSE CaaS Platformmaster node, for example:

root # kubectl get pods -n rook-cephNAME READY STATUS RESTARTS AGErook-ceph-agent-57c9j 1/1 Running 0 22hrook-ceph-agent-b9j4x 1/1 Running 0 22hrook-ceph-operator-cf6fb96-lhbj7 1/1 Running 0 22hrook-discover-mb8gv 1/1 Running 0 22hrook-discover-tztz4 1/1 Running 0 22h

13.4.3 Create the Ceph Cluster

After you modify cluster.yaml according to your needs, you can create the Ceph cluster. Runthe following command on the SUSE CaaS Platform master node:

root # kubectl apply -f cluster.yaml

Watch the 'rook-ceph' namespace to see the Ceph cluster being created. You will see as manyCeph Monitors as configured in the cluster.yaml manifest (default is 3), one Ceph Manager,and as many Ceph OSDs as you have free disks.

162 Create the Rook Operator SES 6

https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/

https://kubernetes.io/docs/concepts/policy/pod-security-policy/

Tip: Temporary OSD PodsWhile bootstrapping the Ceph cluster, you will see some pods with the name rook-ceph-osd-prepare-NODE-NAME run for a while and then terminate with the status 'Complet-ed'. As their name implies, these pods provision Ceph OSDs. They are left without beingdeleted so that you can inspect their logs after their termination. For example:

root # kubectl get pods --namespace rook-cephNAME READY STATUS RESTARTS AGErook-ceph-agent-57c9j 1/1 Running 0 22hrook-ceph-agent-b9j4x 1/1 Running 0 22hrook-ceph-mgr-a-6d48564b84-k7dft 1/1 Running 0 22hrook-ceph-mon-a-cc44b479-5qvdb 1/1 Running 0 22hrook-ceph-mon-b-6c6565ff48-gm9wz 1/1 Running 0 22hrook-ceph-operator-cf6fb96-lhbj7 1/1 Running 0 22hrook-ceph-osd-0-57bf997cbd-4wspg 1/1 Running 0 22hrook-ceph-osd-1-54cf468bf8-z8jhp 1/1 Running 0 22hrook-ceph-osd-prepare-caasp4-worker-0-f2tmw 0/2 Completed 0 9m35srook-ceph-osd-prepare-caasp4-worker-1-qsfhz 0/2 Completed 0 9m33srook-ceph-tools-76c7d559b6-64rkw 1/1 Running 0 22hrook-discover-mb8gv 1/1 Running 0 22hrook-discover-tztz4 1/1 Running 0 22h

13.5 Using Rook as Storage for Kubernetes WorkloadRook allows you to use three different types of storage:

Object Storage

Object storage exposes an S3 API to the storage cluster for applications to put and get data.Refer to https://rook.io/docs/rook/v1.0/ceph-object.html for a detailed description.

Shared File System

A shared le system can be mounted with read/write permission from multiple pods. Thisis useful for applications that are clustered using a shared le system. Refer to https://

rook.io/docs/rook/v1.0/ceph-filesystem.html for a detailed description.

Block Storage

Block storage allows you to mount storage to a single pod. Refer to https://rook.io/docs/

rook/v1.0/ceph-block.html for a detailed description.

163 Using Rook as Storage for Kubernetes Workload SES 6

https://rook.io/docs/rook/v1.0/ceph-object.html

https://rook.io/docs/rook/v1.0/ceph-filesystem.html

https://rook.io/docs/rook/v1.0/ceph-filesystem.html

https://rook.io/docs/rook/v1.0/ceph-block.html

https://rook.io/docs/rook/v1.0/ceph-block.html

13.6 Uninstalling RookTo uninstall Rook, follow these steps:

1. Delete any Kubernetes applications that are consuming Rook storage.

2. Delete all object, le, and/or block storage artifacts that you created by following Sec-

tion 13.5, “Using Rook as Storage for Kubernetes Workload”.

3. Delete the Ceph cluster, operator, and related resources:

root # kubectl delete -f cluster.yamlroot # kubectl delete -f operator.yamlroot # kubectl delete -f common.yaml

4. Delete the data on hosts:

root # rm -rf /var/lib/rook

5. If necessary, wipe the disks that were used by Rook. Refer to https://rook.io/docs/rook/

master/ceph-teardown.html for more details.

164 Uninstalling Rook SES 6

https://rook.io/docs/rook/master/ceph-teardown.html

https://rook.io/docs/rook/master/ceph-teardown.html

A Ceph Maintenance Updates Based on Upstream'Nautilus' Point Releases

Several key packages in SUSE Enterprise Storage 6 are based on the Nautilus release series ofCeph. When the Ceph project (https://github.com/ceph/ceph ) publishes new point releases inthe Nautilus series, SUSE Enterprise Storage 6 is updated to ensure that the product benefitsfrom the latest upstream bugfixes and feature backports.

This chapter contains summaries of notable changes contained in each upstream point releasethat has been—or is planned to be—included in the product.

Nautilus 14.2.12 Point ReleaseIn addition to bug fixes, this major upstream release brought a number of notable changes:

The ceph df command now lists the number of PGs in each pool.

MONs now have a config option mon_osd_warn_num_repaired , 10 by default. If any OSDhas repaired more than this many I/O errors in stored data, a OSD_TOO_MANY_REPAIRShealth warning is generated. In order to allow clearing of the warning, a new commandceph tell osd.SERVICE_ID clear_shards_repaired COUNT has been added. By de-fault, it will set the repair count to 0. If you want to be warned again if additional repairsare performed, you can provide a value to the command and specify the value of mon_os-d_warn_num_repaired . This command will be replaced in future releases by the healthmute/unmute feature.

It is now possible to specify the initial MON to contact for Ceph tools and daemons using themon_host_override config option or --mon-host-override IP command-line switch.This generally should only be used for debugging and only affects initial communicationwith Ceph’s MON cluster.

Fix an issue with osdmaps not being trimmed in a healthy cluster.

165 Nautilus 14.2.12 Point Release SES 6

https://github.com/ceph/ceph


RGW: The radosgw-admin sub-commands dealing with orphans – radosgw-admin or-phans find , radosgw-admin orphans finish , radosgw-admin orphans list-jobs– have been deprecated. They have not been actively maintained and they store interme-diate results on the cluster, which could ll a nearly-full cluster. They have been replacedby a tool, currently considered experimental, rgw-orphan-list .

Now, when noscrub and/or nodeep-scrub ags are set globally or per pool, scheduledscrubs of the type disabled will be aborted. All user initiated scrubs are not interrupted.

Fixed a ceph-osd crash in committed OSD maps when there is a failure to encode the rstincremental map.

Nautilus 14.2.10 Point ReleaseThis upstream release patched one security aw:

CVE-2020-10753: rgw: sanitize newlines in s3 CORSConfiguration’s ExposeHeader

In addition to security aws, this major upstream release brought a number of notable changes:

The pool parameter target_size_ratio , used by the PG autoscaler, has changed mean-ing. It is now normalized across pools, rather than specifying an absolute ratio. If you haveset target size ratios on any pools, you may want to set these pools to autoscale warn modeto avoid data movement during the upgrade:

ceph osd pool set POOL_NAME pg_autoscale_mode warn

The behaviour of the -o argument to the RADOS tool has been reverted to its originalbehaviour of indicating an output le. This reverts it to a more consistent behaviour whencompared to other tools. Specifying object size is now accomplished by using an uppercase O -O .

The format of MDSs in ceph fs dump has changed.


Ceph will issue a health warning if a RADOS pool’s size is set to 1 or, in other words,the pool is configured with no redundancy. This can be xed by setting the pool size tothe minimum recommended value with:

cephadm@adm > ceph osd pool set pool-name size num-replicas

The warning can be silenced with:

cephadm@adm > ceph config set global mon_warn_on_pool_no_redundancy false

RGW: bucket listing performance on sharded bucket indexes has been notably improved byheuristically – and significantly, in many cases – reducing the number of entries requestedfrom each bucket index shard.

Nautilus 14.2.9 Point ReleaseThis upstream release patched two security aws:

CVE-2020-1759: Fixed nonce reuse in msgr V2 secure mode

CVE-2020-1760: Fixed XSS due to RGW GetObject header-splitting

In SES 6, these aws were patched in Ceph version 14.2.5.389+gb0f23ac248.


The default value of bluestore_min_alloc_size_ssd has been changed to 4K to improveperformance across all workloads.

The following OSD memory config options related to BlueStore cache autotuning can nowbe configured during runtime:

osd_memory_base (default: 768 MB)osd_memory_cache_min (default: 128 MB)osd_memory_expected_fragmentation (default: 0.15)osd_memory_target (default: 4 GB)


You can set the above options by running:

cephadm@adm > ceph config set osd OPTION VALUE

The Ceph Manager now accepts profile rbd and profile rbd-read-only user capa-bilities. You can use these capabilities to provide users access to MGR-based RBD function-ality such as rbd perf image iostat and rbd perf image iotop .

The configuration value osd_calc_pg_upmaps_max_stddev used for upmap balanc-ing has been removed. Instead, use the Ceph Manager balancer configuration optionupmap_max_deviation which now is an integer number of PGs of deviation from the tar-get PGs per OSD. You can set it with a following command:

cephadm@adm > ceph config set mgr mgr/balancer/upmap_max_deviation 2

The default upmap_max_deviation is 5. There are situations where crush rules wouldnot allow a pool to ever have completely balanced PGs. For example, if crush requires 1replica on each of 3 racks, but there are fewer OSDs in 1 of the racks. In those cases, theconfiguration value can be increased.

CephFS: multiple active Metadata Server forward scrub is now rejected. Scrub is currentlyonly permitted on a le system with a single rank. Reduce the ranks to one via ceph fsset FS_NAME max_mds 1 .

Ceph will now issue a health warning if a RADOS pool has a pg_num value that is not apower of two. This can be xed by adjusting the pool to an adjacent power of two:

cephadm@adm > ceph osd pool set POOL_NAME pg_num NEW_PG_NUM

Alternatively, you can silence the warning with:

cephadm@adm > ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false


Nautilus 14.2.7 Point ReleaseThis upstream release patched two security aws:

CVE-2020-1699: a path traversal aw in Ceph Dashboard that could allow for potentialinformation disclosure.

CVE-2020-1700: a aw in the RGW beast front-end that could lead to denial of servicefrom an unauthenticated client.

In SES 6, these aws were patched in Ceph version 14.2.5.382+g8881d33957b.

Nautilus 14.2.6 Point ReleaseThis release xed a Ceph Manager bug that caused MGRs becoming unresponsive on largerclusters. SES users were never exposed to the bug.

Nautilus 14.2.5 Point Release

Health warnings are now issued if daemons have recently crashed. Ceph will nowissue health warnings if daemons have recently crashed. Ceph has been collecting crashreports since the initial Nautilus release, but the health alerts are new. To view new crashes(or all crashes, if you have just upgraded), run:

cephadm@adm > ceph crash ls-new

To acknowledge a particular crash (or all crashes) and silence the health warning, run:

cephadm@adm > ceph crash archive CRASH-IDcephadm@adm > ceph crash archive-all

pg_num must be a power of two, otherwise HEALTH_WARN is reported. Ceph will nowissue a health warning if a RADOS pool has a pg_num value that is not a power of two.You can x this by adjusting the pool to a nearby power of two:

cephadm@adm > ceph osd pool set POOL-NAME pg_num NEW-PG-NUM


Alternatively, you can silence the warning with:

cephadm@adm > ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false

Pool size needs to be greater than 1 otherwise HEALTH_WARN is reported. Ceph willissue a health warning if a RADOS pool’s size is set to 1 or if the pool is configured withno redundancy. Ceph will stop issuing the warning if the pool size is set to the minimumrecommended value:

cephadm@adm > ceph osd pool set POOL-NAME size NUM-REPLICAS

You can silence the warning with:

cephadm@adm > ceph config set global mon_warn_on_pool_no_redundancy false

Health warning is reported if average OSD heartbeat ping time exceeds the thresh-old. A health warning is now generated if the average OSD heartbeat ping time exceeds aconfigurable threshold for any of the intervals computed. The OSD computes 1 minute, 5minute and 15 minute intervals with average, minimum, and maximum values.A new configuration option, mon_warn_on_slow_ping_ratio , specifies a percentage ofosd_heartbeat_grace to determine the threshold. A value of zero disables the warning.A new configuration option, mon_warn_on_slow_ping_time , specified in milliseconds,overrides the computed value and causes a warning when OSD heartbeat pings take longerthan the specified amount.A new command ceph daemon mgr.MGR-NUMBER dump_osd_network THRESHOLD lists allconnections with a ping time longer than the specified threshold or value determined bythe configuration options, for the average for any of the 3 intervals.A new command ceph daemon osd.# dump_osd_network THRESHOLD will do the sameas the previous one but only including heartbeats initiated by the specified OSD.

Changes in the telemetry MGR module.A new 'device' channel (enabled by default) will report anonymized hard disk and SSDhealth metrics to telemetry.ceph.com in order to build and improve device failure pre-diction algorithms.Telemetry reports information about CephFS le systems, including:

How many MDS daemons (in total and per le system).

Which features are (or have been) enabled.


How many data pools.

Approximate le system age (year and the month of creation).

How many les, bytes, and snapshots.

How much metadata is being cached.

Other miscellaneous information:

Which Ceph release the monitors are running.

Whether msgr v1 or v2 addresses are used for the monitors.

Whether IPv4 or IPv6 addresses are used for the monitors.

Whether RADOS cache tiering is enabled (and the mode).

Whether pools are replicated or erasure coded, and which erasure code profile plug-in and parameters are in use.

How many hosts are in the cluster, and how many hosts have each type of daemon.

Whether a separate OSD cluster network is being used.

How many RBD pools and images are in the cluster, and how many pools have RBDmirroring enabled.

How many RGW daemons, zones, and zonegroups are present and which RGW fron-tends are in use.

Aggregate stats about the CRUSH Map, such as which algorithms are used, how bigbuckets are, how many rules are defined, and what tunables are in use.

If you had telemetry enabled before 14.2.5, you will need to re-opt-in with:

cephadm@adm > ceph telemetry on

If you are not comfortable sharing device metrics, you can disable that channel rst beforere-opting-in:

cephadm@adm > ceph config set mgr mgr/telemetry/channel_device falsecephadm@adm > ceph telemetry on


You can view exactly what information will be reported rst with:

cephadm@adm > ceph telemetry show # see everythingcephadm@adm > ceph telemetry show device # just the device infocephadm@adm > ceph telemetry show basic # basic cluster info

New OSD daemon command dump_recovery_reservations . It reveals the recoverylocks held ( in_progress ) and waiting in priority queues. Usage:

cephadm@adm > ceph daemon osd.ID dump_recovery_reservations

New OSD daemon command dump_scrub_reservations . It reveals the scrub reserva-tions that are held for local (primary) and remote (replica) PGs. Usage:

cephadm@adm > ceph daemon osd.ID dump_scrub_reservations

RGW now supports S3 Object Lock set of APIs. RGW now supports S3 Object Lock set ofAPIs allowing for a WORM model for storing objects. 6 new APIs have been added PUT/GET bucket object lock, PUT/GET object retention, PUT/GET object legal hold.

RGW now supports List Objects V2. RGW now supports List Objects V2 as specified athttps://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html .

Nautilus 14.2.4 Point ReleaseThis point release fixes a serious regression that found its way into the 14.2.3 point release. Thisregression did not affect SUSE Enterprise Storage customers because we did not ship a versionbased on 14.2.3.


Fixed a denial of service vulnerability where an unauthenticated client of Ceph ObjectGateway could trigger a crash from an uncaught exception.

Nautilus-based librbd clients can now open images on Jewel clusters.


https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html

The Object Gateway num_rados_handles has been removed. If you were using a valueof num_rados_handles greater than 1, multiply your current objecter_inflight_opsand objecter_inflight_op_bytes parameters by the old num_rados_handles to getthe same throttle behavior.

The secure mode of Messenger v2 protocol is no longer experimental with this release.This mode is now the preferred mode of connection for monitors.

osd_deep_scrub_large_omap_object_key_threshold has been lowered to detect anobject with a large number of omap keys more easily.

The Ceph Dashboard now supports silencing Prometheus notifications.


The no{up,down,in,out} related commands have been revamped. There are now twoways to set the no{up,down,in,out} ags: the old command

ceph osd [un]set FLAG

which sets cluster-wide ags; and the new command

ceph osd [un]set-group FLAGS WHO

which sets ags in batch at the granularity of any crush node or device class.

radosgw-admin introduces two subcommands that allow the managing of expire-staleobjects that might be left behind after a bucket reshard in earlier versions of Object Gate-way. Expire-stale objects are expired objects that should have been automatically erasedbut still exist and need to be listed and removed manually. One subcommand lists suchobjects and the other deletes them.

Earlier Nautilus releases (14.2.1 and 14.2.0) have an issue where deploying a single newNautilus BlueStore OSD on an upgraded cluster (i.e. one that was originally deployed pre-Nautilus) breaks the pool utilization statistics reported by ceph df . Until all OSDs havebeen reprovisioned or updated (via ceph-bluestore-tool repair ), the pool statisticswill show values that are lower than the true value. This is resolved in 14.2.2, such that


the cluster only switches to using the more accurate per-pool stats after all OSDs are 14.2.2or later, are Block Storage, and have been updated via the repair function if they werecreated prior to Nautilus.

The default value for mon_crush_min_required_version has been changed from fire-fly to hammer , which means the cluster will issue a health warning if your CRUSH tun-ables are older than Hammer. There is generally a small (but non-zero) amount of datathat will be re-balanced after making the switch to Hammer tunables.If possible, we recommend that you set the oldest allowed client to hammer or later. Todisplay what the current oldest allowed client is, run:

cephadm@adm > ceph osd dump | grep min_compat_client

If the current value is older than hammer , run the following command to determinewhether it is safe to make this change by verifying that there are no clients older thanHammer currently connected to the cluster:

cephadm@adm > ceph features

The newer straw2 CRUSH bucket type was introduced in Hammer. If you verify that allclients are Hammer or newer, it allows new features only supported for straw2 buckets tobe used, including the crush-compat mode for the Balancer (Book “Administration Guide”,

Chapter 21 “Ceph Manager Modules”, Section 21.1 “Balancer”).

Find detailed information about the patch at https://download.suse.com/Download?buil-

did=D38A7mekBz4~

Nautilus 14.2.1 Point ReleaseThis was the rst point release following the original Nautilus release (14.2.0). The original('General Availability' or 'GA') version of SUSE Enterprise Storage 6 was based on this pointrelease.


https://download.suse.com/Download?buildid=D38A7mekBz4~

https://download.suse.com/Download?buildid=D38A7mekBz4~

Glossary

General

Admin nodeThe node from which you run the ceph-deploy utility to deploy Ceph on OSD nodes.

BucketA point that aggregates other nodes into a hierarchy of physical locations.

Important: Do Not Mix with S3 BucketsS3 buckets or containers represent different terms meaning folders for storing objects.

CRUSH, CRUSH MapControlled Replication Under Scalable Hashing: An algorithm that determines how to store andretrieve data by computing data storage locations. CRUSH requires a map of the cluster topseudo-randomly store and retrieve data in OSDs with a uniform distribution of data acrossthe cluster.

Monitor node, MONA cluster node that maintains maps of cluster state, including the monitor map, or the OSDmap.

NodeAny single machine or server in a Ceph cluster.

OSDDepending on context, Object Storage Device or Object Storage Daemon. The ceph-osd dae-mon is the component of Ceph that is responsible for storing objects on a local le systemand providing access to them over the network.

OSD nodeA cluster node that stores data, handles data replication, recovery, backfilling, rebalancing,and provides some monitoring information to Ceph monitors by checking other Ceph OSDdaemons.

175 SES 6

PGPlacement Group: a sub-division of a pool, used for performance tuning.

PoolLogical partitions for storing objects such as disk images.

Routing treeA term given to any diagram that shows the various routes a receiver can run.

Rule SetRules to determine data placement for a pool.

Ceph Specific Terms

AlertmanagerA single binary which handles alerts sent by the Prometheus server and notifies end user.

Ceph Storage ClusterThe core set of storage software which stores the user’s data. Such a set consists of Cephmonitors and OSDs.

AKA “Ceph Object Store”.

GrafanaDatabase analytics and monitoring solution.

PrometheusSystems monitoring and alerting toolkit.

Object Gateway Specific Terms

archive sync moduleModule that enables creating an Object Gateway zone for keeping the history of S3 objectversions.

176 SES 6

Object GatewayThe S3/Swift gateway component for Ceph Object Store.

177 SES 6

B Documentation Updates

This chapter lists content changes for this document since the release of the latest mainte-nance update of SUSE Enterprise Storage 5. You can nd changes related to the cluster deploy-ment that apply to previous versions in https://documentation.suse.com/ses/5.5/single-html/ses-

deployment/#ap-deploy-docupdate .

The document was updated on the following dates:

Section B.1, “Maintenance update of SUSE Enterprise Storage 6 documentation”

Section B.2, “June 2019 (Release of SUSE Enterprise Storage 6)”

B.1 Maintenance update of SUSE Enterprise Storage6 documentation

Added a list of new features for Ceph 14.2.5 in the 'Ceph Maintenance Updates Based onUpstream 'Nautilus' Point Releases' appendix.

Suggested running rpmconfigcheck to prevent losing local changes in Section 6.5, “Per-

Node Upgrade Instructions” (https://jira.suse.com/browse/SES-348 ).

Added Book “Tuning Guide”, Chapter 8 “Improving Performance with LVM cache” (https://ji-

ra.suse.com/browse/SES-269 ).

Added Chapter 13, SUSE Enterprise Storage 6 on Top of SUSE CaaS Platform 4 Kubernetes Cluster

(https://jira.suse.com/browse/SES-720 ).

Added a tip on monitoring cluster nodes' status during upgrade in Section 6.6, “Upgrade the

Admin Node” (https://bugzilla.suse.com/show_bug.cgi?id=1154568 ).

Made the network recommendations synchronized and more specific in Section 2.1.1, “Net-

work Recommendations” (https://bugzilla.suse.com/show_bug.cgi?id=1156631 ).

Added Section 6.5.2, “Node Upgrade Using the SUSE Distribution Migration System” (https://bugzil-

la.suse.com/show_bug.cgi?id=1154438 ).

Made the upgrade chapter sequential, Chapter 6, Upgrading from Previous Releases (https://

bugzilla.suse.com/show_bug.cgi?id=1144709 ).

178 Maintenance update of SUSE Enterprise Storage 6 documentation SES 6

https://documentation.suse.com/ses/5.5/single-html/ses-deployment/#ap-deploy-docupdate

https://documentation.suse.com/ses/5.5/single-html/ses-deployment/#ap-deploy-docupdate

https://jira.suse.com/browse/SES-348




https://bugzilla.suse.com/show_bug.cgi?id=1154568






Added changelog entry for Ceph 14.2.4 (https://bugzilla.suse.com/show_bug.c-

gi?id=1151881 ).

Unified the pool name 'cephfs_metadata' in examples in Chapter 12, Installation of NFS Gane-

sha (https://bugzilla.suse.com/show_bug.cgi?id=1148548 ).

Updated Section 5.5.2.1, “Specification” to include more realistic values (https://bugzil-


Added two new repositories for 'Module-Desktop' as our customers use most-ly GUI in Section 6.5.1, “Manual Node Upgrade Using the Installer DVD” (https://bugzil-


deepsea-cli is not a dependency of deepsea in Section 5.4, “DeepSea CLI” (https://bugzil-


Added a hint to migrate ntpd to chronyd in Section 6.2.9, “Migrate from ntpd to chronyd”

(https://bugzilla.suse.com/show_bug.cgi?id=1135185 ).

Added Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.16 “Deacti-

vating Tuned Profiles” (https://bugzilla.suse.com/show_bug.cgi?id=1130430 ).

Consider migrating whole OSD node in Section 6.13.3, “OSD Deployment” (https://bugzil-


Added a point about migrating MDS names in Section 6.2.6, “Verify MDS Names” (https://


B.2 June 2019 (Release of SUSE Enterprise Storage 6)GENERAL UPDATES

Added Section 5.5.2, “DriveGroups” (jsc#SES-548).

Rewrote Chapter 6, Upgrading from Previous Releases (jsc#SES-88).

Added Section 7.2.1, “Enabling IPv6 for Ceph Cluster Deployment” (jsc#SES-409).

Made Block Storage the default storage back-end (Fate#325658).

Removed all references to external online documentation, replaced with the relevant con-tent (Fate#320121).

179 June 2019 (Release of SUSE Enterprise Storage 6) SES 6
















BUGFIXES

Added information about AppArmor during upgrade in Section 6.2.5, “Adjust AppArmor”

(https://bugzilla.suse.com/show_bug.cgi?id=1137945 ).

Added information on co-location of Ceph services on High Availability setups in Sec-

tion 12.3, “High Availability Active-Passive Configuration” (https://bugzilla.suse.com/show_bug.c-

gi?id=1136871 ).

Added a tip about orphaned packages in Section 6.5, “Per-Node Upgrade Instructions” (https://


Updated profile-* with role-storage in Tip: Deploying Monitor Nodes without Defining

OSD Profiles (https://bugzilla.suse.com/show_bug.cgi?id=1138181 ).

Added Section 6.13, “Migration from Profile-based Deployments to DriveGroups” (https://bugzil-


Added Section 6.13, “Migration from Profile-based Deployments to DriveGroups” (https://bugzil-


Added Section 6.8, “Upgrade Metadata Servers” (https://bugzilla.suse.com/show_bug.c-

gi?id=1135064 ).

MDS cluster needs to be shrunk in Section 6.8, “Upgrade Metadata Servers” (https://bugzil-


Changed configuration le to /srv/pillar/ceph/stack/global.yml (https://bugzil-


Updated various parts of Book “Administration Guide”, Chapter 29 “Exporting Ceph Data via Sam-

ba” (https://bugzilla.suse.com/show_bug.cgi?id=1101478 ).

master_minion.sls is gone in Section 5.3, “Cluster Deployment” (https://bugzil-


Mentioned the deepsea-cli package in Section 5.4, “DeepSea CLI” (https://bugzil-


180 June 2019 (Release of SUSE Enterprise Storage 6) SES 6






















Deployment Guide - SUSE Enterprise Storage 610.1 iSCSI Block Storage 115 The Linux Kernel iSCSI...

Documents

Transcript of Deployment Guide - SUSE Enterprise Storage 610.1 iSCSI Block Storage 115 The Linux Kernel iSCSI...