Configuring Resiliency and Resiliency Groups · D-2 Cisco Cloud Object Storage Release 3.22.1 User...

Cisco Clo

A

P P E N D I X D
Configuring Resiliency and Resiliency Groups
To help protect against loss of assets due to network failure, COS Release Cisco Cloud Object Storage Release 3.22.1 User Guide supports data resiliency:

• At the node level, enabling recovery of data lost due to drive failure within a node.

• At the cluster level, enabling recovery of lost data due to node failure within a cluster.

In addition, COS supports bonding of two management interface ports to provide redundant connectivity in the event of a network interface card (NIC) failure.

This section describes these features and provides instructions for configuring them through the COS setup file or (for data resiliency only) through the V2PC GUI.

Configuring ResiliencyCOS provides resiliency at the node and cluster level using one of two methods: mirroring or erasure coding. Resiliency at the node level is achieved using either local mirroring (LM) or local erasure coding (LEC), which is the default. Similarly, resiliency at the cluster level is achieved using either remote mirroring (RM), which is the default, or distributed erasure coding (DEC). Content resiliency policies are set per COS cluster, and all nodes within a cluster share the same resiliency patterns.

Note We recommend using either remote mirroring or DEC, but not both. COS 3.20.1 does not support migration from one scheme to another while preserving stored content. COS treats remote mirroring and DEC as mutually exclusive options. DEC is preferred, so if both are enabled, COS uses DEC.

About MirroringMirroring configures the disks in a node, or the nodes in a cluster, to hold a specified number of exact copies of the object data. The number of copies can be set from 1 to 4, with 1 representing the original object data only, and 4 representing the object data plus three copies. Thus, for example, a value of 2 specifies one copy plus the original object data.

To configure mirroring in the GUI, you specify the number of desired copies differently for local and remote mirroring:

• For local mirroring, the number you specify includes the original plus the number of local copies. For example, setting the number to 3 in the GUI specifies the original and two local copies.

D-1ud Object Storage Release 3.22.1 User Guide

Appendix D Configuring Resiliency and Resiliency Groups Configuring Resiliency

• For remote mirroring, the number you specify includes only the number of remote copies. For example, setting the number to 3 in the GUI specifies three remote copies in addition to the (local) original.

Note For clusters of two or more nodes, always use DEC instead of remote mirroring.

About Erasure CodingErasure coding, or software RAID, is a method of data protection in which cluster data is redundantly encoded, divided into blocks, and distributed or striped across different locations or storage media. The goal is to enable any data lost due to a drive or node failure in the cluster to be reconstructed using data stored in a part of the cluster that was not affected by the failure.

COS 3.20.1 supports both local erasure coding (LEC), in which data is striped across the disks in a node, and distributed erasure coding (DEC), in which data is striped across nodes in a cluster.

For both LEC and DEC, data recovery is performed as a low-priority background task to avoid possible impact on performance. Even so, erasure coding provides faster data reconstruction than hardware RAID. Speed is important in maintaining resiliency, because a failed node cannot help to recover other failed nodes until it has fully recovered.

Defining ResiliencyTwo key factors define the degree of resiliency of a given cluster configuration:

• Resiliency ratio (RR) functionally divides the disks or nodes in a cluster into data blocks and parity blocks. The RR of a cluster indicates how much of its total storage is devoted to protection against data loss. Conventionally, the RR is expressed as an N:M value, where N represents the number of data blocks and M is the number of parity blocks.

• Resiliency factor (RF) is the number of data (N) blocks that can fail at one time before losing the ability to fully recover any lost data. The RF is equivalent to the parity (M) value in the RR for the cluster, and can be selected in the GUI as described in About Mirroring, page D-1.

For LEC, COS assigns a default RR of 12:2, but you can choose another RR up to 18:4 maximum. For DEC, the cos-aic-client calculates the RR for a chosen cluster size and RF, but you can choose another RR up to 18:18 maximum. As you consider whether to use the defaults or assign new values, you must weigh a number of factors specific to your deployment, such as:

• The amount of storage that can be devoted to resiliency

• The degree of redundancy (RF) that must be achieved

• How quickly failed data blocks can be recovered

The following examples illustrate how these factors can be used to determine the best resiliency scheme for a particular COS node or cluster.

Example 1: LEC at Node Level

To see how the LEC resiliency ratio affects individual node behavior, consider a cluster in which each node has 4 TB disks, runs at 80% full, and can maintain a data read rate of 20 Gbps (2.5 GB/s).

If each node in the cluster has an assigned LEC RR of 10:1:

D-2Cisco Cloud Object Storage Release 3.22.1 User Guide


• Each node can recover from a single disk failure on its own, but if more than one concurrent disk failure occurs, DEC must be used to rebuild the data.

• 1 MB of parity data is created for every 10 MB of data written to each node, representing a storage overhead of 1/10 or 10%. Also, in the event of a lost disk, the rebuild logic must read in 10 MB of data for every 1 MB recovered, representing a rebuild overhead of 10:1.

• If a node loses a single disk when running at 80% full, then (4 TB x 0.8 =) 3.2 TB of data must be rebuilt. At a 20 Gbps data read rate and a 10:1 rebuild overhead, the data rebuild rate is (20 Gbps / 10 =) 2 Gb (256 MB) per second, or roughly 0.9 TB per hour. This means that 3.2 TB of data can be rebuilt in under 4 hours.

By comparison, if the same cluster were assigned the default LEC RR of 12:2 instead of 10:1:

• Each node can recover from two concurrent disk failures on its own, before using DEC.

• Each node stores 2 MB of parity for every 12 MB of data, representing a 17% a storage overhead and a 12:1 rebuild overhead.

• At a data rebuild rate of (20 Gbps / 12 =) 0.75 TB per hour, recovery takes about 4.3 hours for one lost disk (3.2 TB), or 8.6 hours for two lost disks.

Note • Actual data rebuild times are highly dependent on the capacity of the hardware components in the server. Servers that can deliver significantly more throughput reduce the rebuild time, while servers with less capacity increase the rebuild time. LEC uses local disk reads when recovering data, so the speed of the disk channel directly impacts recovery time.

• Rebuild activity runs at a lower priority than regular object reads and writes, in an effort to maintain normal performance during rebuild periods. However, because normal reads and writes take priority over repair activity, a busy server takes longer to rebuild data from a lost disk than an idle server, which can devote more bandwidth to repair.

• An exception to the lower priority for rebuild activity is made for on-demand repair work. This occurs when a client requests data that is waiting to be rebuilt due to a disk failure. In this case, the requested data is rebuilt immediately so that the repair work is transparent to the clients.

Example 2: DEC at Cluster Level

To see how the DEC resiliency ratio affects cluster configuration and behavior, consider a cluster of 11 nodes, with each node running 80% full, maintaining a data read rate of 16 Gbps (2 GB/s), and having a total storage capacity of 100 TB.

If the cluster has an assigned DEC RR of 8:2:

• The cluster requires a minimum of 10 active nodes (8 data plus 2 parity) to continue writing data at the assigned 8:2 resiliency ratio. With 11 nodes, this cluster can lose one node and still continue to provide the same resiliency for newly added objects.

• The cluster can recover from two concurrent node failures, but to do so, must have enough free disk space to hold the recovered data for two nodes. For an 11-node cluster, this means that 2/11, or just under 20%, of the total disk space must be kept free. This can be reduced to 10% if it is safe to assume that at least one replacement node can be brought online.

• 2 MB of DEC parity data is created for every 8 MB of data written to the cluster, representing a storage overhead of 2/8 or 25%. This compares favorably to mirroring, whose storage overhead is four times greater (100%) but only enables recovery from one node failure, rather than two. Also, in the event of a lost node, the rebuild logic must read in 8 MB of data for every 1 MB recovered, representing a rebuild overhead of 8:1.



• If one node is lost when running at 80% full, then (100 x 0.8 =) 80 TB of data must be rebuilt by the 10 remaining nodes in the cluster. Dividing the task equally among the remaining 10 servers, each must rebuild 8 TB of data. At a 16 Gbps data read rate and an 8:1 rebuild overhead, the data rebuild rate is (16 Gbps / 8 =) 2 Gb (256 MB) per second, or 0.9 TB per hour. At this rate, each node can rebuild its 8 TB share of the total lost data in 8.9 hours.

By comparison, if the same cluster were assigned a DEC RR of 8:1 instead of 8:2:

• The cluster can fully recover data from only a single failed node, but requires only half the free disk space (10%) of the 8:2 RR case to support this recovery.

• Each node stores 1 MB of parity for every 8 MB of data, representing a 12.5% a storage overhead and an 8:1 rebuild overhead.

• Because the rebuild rate is determined by the N value (8), it is the same for 8:1 as for 8:2. At a data rebuild rate of (16 Gbps / 8 =) 0.9 TB per hour, recovery of one lost node (8 TB) takes 8.9 hours.

Note • Actual data rebuild times are directly affected by the number of nodes in the cluster. If there are fewer nodes in the cluster, the work is divided among fewer servers, resulting in longer rebuild times. DEC uses remote data reads across the network when recovering data, so the speed of the network channel directly impacts the recovery time.

• Actual data rebuild times are highly dependent on the capacity of the hardware components in the server. Servers that can deliver significantly more throughput reduce the rebuild time, while servers with less capacity increase the rebuild time.

• Rebuild activity runs at a lower priority than regular object reads and writes, in an effort to maintain normal performance during rebuild periods. However, because normal reads and writes take priority over repair activity, a busy cluster takes longer to rebuild data from a lost node than an idle cluster, which can devote more bandwidth to repair.

• An exception to the lower priority for rebuild activity is made for on-demand repair work. This occurs when a client requests data that is waiting to be rebuilt due to a disk failure. In this case, the requested data is rebuilt immediately so that the repair work is transparent to the clients.

Example 3: Using LEC and DEC Together

To see how LEC and DEC work together, consider the cluster configuration described in the previous examples, but with a LEC RR of 10:1 for each node and a DEC RR of 8:2 for the cluster as a whole.

With this configuration:

• The cluster can recover from two concurrent node failures as well as a single disk failure per node at the same time. A LEC RR of 10:1 means that a node must lose at least two disks concurrently before DEC is required to recover the lost data.

• Because the DEC parity data is distributed across the nodes in the cluster, it is further protected by LEC at the node level. This allows recovery from disk failure to be performed locally, including the DEC parity data stored on the node.

• The DEC RR of 8:2 means that every 10 MB block of data includes 8 MB for content storage and 2 MB for DEC parity. In addition, the LEC RR of 10:1 means that every 10 MB block of (content + DEC parity) data requires another 1 MB of parity data. So in total, every 8 MB of content storage requires (2 DEC + 1 LEC =) 3 MB of parity data, giving a total parity overhead of (3 / 8 =) 37.5%.

For any cluster configuration using both DEC and LEC, the total parity overhead is found by first applying the DEC overhead and then applying the LEC overhead, as follows:

Total Parity Overhead = (M1/N1) + ((1 + M1/N1) X (M2/N2))



where N1:M1 represents DEC parity and N2:M2 represents LEC parity.

Using the values from this example to illustrate:

Total Parity Overhead = (2/8) + ((1 + 2/8) X (1/10) = 37.5%

Note The parity calculations just described do not account for the additional free disk space needed for storage of recovered data. Be sure to include this requirement in your total overhead calculations. For example, a DEC RR of 8:2 requires free disk space for up to two failed nodes.

Configuring Resiliency Using the V2PC GUITo configure resiliency through the V2PC GUI, open the GUI as described in Accessing the V2PC GUI, page A-2 and navigate to Cisco Cloud Object Store (COS) > COS Clusters.

Figure D-1 V2PC GUI, COS Clusters Page

Locate the cluster to be updated and click its Edit icon to enable it for editing. Choose the desired resiliency policy from the Asset Redundancy Policy drop-down list for the endpoint.

Note • If the desired policy does not appear, choose Service Domain Objects > Asset Redundancy Policies to confirm the policy exists. If not, click Add Row and create a new policy. Enter a profile name and expected cluster size, select a model (appropriate interfaces and names auto-populate), assign the profile to a cluster, configure a DNS server, and assign IP pools to each interface.

• The GUI lets you enter an expected cluster size when creating the COS node initialization profile. For DEC, the cos-aic-client then uses this cluster size to calculate a suitable N value given the expected cluster size and the selected resiliency factor M. Both values can still be overridden as described in Finding N:M Values, page D-8 by editing or creating the /arroyo/test/aftersetupfile COS file to hold them.



Caution When manually setting N:M (or any other) values that must persist, be sure to use aftersetupfile and not setupfile. The settings in setupfile can be overwritten by changes made via the GUI. Also, you must configure the aftersetupfile before registering the new node to V2PC. Otherwise, DEC settings within the cluster will be inconsistent, requiring at least one service-disrupting reboot to correct.

Configuring Local Mirroring ManuallyTo configure local mirroring on a node manually:

Step 1 Open (or if not present, create) the COS file /arroyo/test/aftersetupfile for editing.

Step 2 Include the line vault local copy count in the file and set the value to 2, 3, or 4 as appropriate.

Note Setting the value to 1 simply maintains the original data and creates no additional copies.

Step 3 Disable local erasure coding by setting allow vault raid to 0 (or simply omit or remove this line).

Example# CServer core configuration. Changes to this file require a server reboot. serverid 1 groupid 3333 arrayid 6666

. . . .allow vault raid 0vault local copy count 2vault mirror copies 2allow server raid 0allow tcp traffic 1

. . . . er_enable 0 rtp_enable 0

Configuring Local Erasure Coding ManuallyTo enable local erasure coding manually:


Step 2 Set allow vault raid to 1 to enable LEC.

Step 3 Disable local mirroring by setting vault local copy count to 1 (or simply omit or remove this line).

Example# CServer core configuration. Changes to this file require a server reboot. serverid 1 groupid 3333



arrayid 6666. . . .allow vault raid 1vault local copy count 1vault mirror copies 2allow server raid 0allow tcp traffic 1

. . . .er_enable 0

rtp_enable 0

Migrating from LM to LEC ManuallyTo migrate a service endpoint from local mirroring to local erasure coding:

Step 1 Temporarily leave local mirroring enabled for the service endpoint.

Step 2 Enable LEC for the service endpoint and let it establish the parity needed for each data object.

Step 3 When parity is established, disable local mirroring.

Configuring Remote Mirroring Manually

Note For clusters of two or more nodes, always use DEC instead of remote mirroring.

To enable and configure remote mirroring manually:


Step 2 Set vault mirror copies to the value 2, 3, or 4 as appropriate to enable remote mirroring. The value you enter specifies the object data plus the number of exact copies desired.

Note Setting the value to 1 simply maintains the original data and creates no additional copies.

Step 3 Disable distributed erasure coding by setting allow server raid to 0 (or simply omit or remove this line).


. . . .allow vault raid 0vault local copy count 2vault mirror copies 2allow server raid 0

allow tcp traffic 1 . . . .



er_enable 0 rtp_enable 0

Configuring Distributed Erasure Coding ManuallyTo enable and configure distributed erasure coding manually:


Step 2 Set allow server raid to 1 and add the following lines immediately below:

• target server raid data blocks <value>

This controls the number of data blocks used. The default <value> is 8, and the valid range is 1-18.

• target server raid parity blocks <value>

This controls the number of parity blocks used. The default <value> is 1, and the valid range is 1-18.

Note See Finding N:M Values, page D-8 to determine appropriate data block and parity block values.

Step 3 Disable remote mirroring by setting vault mirror copies to 0 (or simply omit or remove this line).


. . . .allow vault raid 0vault local copy count 2vault mirror copies 1allow server raid 1target server raid data blocks 8target server raid parity blocks 1

allow tcp traffic 1 . . . . er_enable 0 rtp_enable 0

Finding N:M ValuesTo configure DEC, you must specify the number of data blocks (N) and parity blocks (M) used for data encoding. Table D-1 shows the corresponding data-to-parity block (N:M) values for a given number of nodes in a cluster and for a given degree of resiliency desired for the cluster. For details, see Defining Resiliency, page D-2.

Note COS does not currently support configuration of new N:M (data:parity) block values through the V2PC GUI. If you need to configure new N:M values, you must do so in the aftersetup file.



In this table:

• Nodes is the number of nodes in the cluster.

• RF is the desired resiliency factor, or number of nodes that can fail without data loss.

• Min is the minimum number of nodes required to achieve a given resiliency factor.

The ratios appearing in the cells of the table are N:M values, where N is the number of data blocks and M is the number of parity blocks needed to achieve the desired resiliency factor for a given node count.

To use the table to find the N:M values for a cluster:

Step 1 In the Nodes column, locate the row corresponding to the number of nodes in the cluster.

Note For COS 3.20.1, you must select the N:M configuration based upon the initial nodes in the cluster. COS does not currently support adding nodes to a cluster after DEC is configured for the cluster.

Step 2 Locate the column in the table whose header represents the desired RF value for the cluster.

Step 3 Find the corresponding N:M value at the intersection of the row and column just located.

Step 4 Configure DEC using N as the number of data blocks and M as the number of parity blocks.

Table D-1 lists the possible N:M values for DEC for 1-20 nodes and a resiliency factor (RF) of 0-4.

Caution Table D-1 includes two resiliency factors, RF = 3 and RF = 4, which are supported in COS 3.8.1 but not in COS 3.20.1, due to a change in the method used to calculate resiliency requirements for metadata. To ensure the desired resiliency for both objects and metadata with COS 3.20.1, use RF = 0, 1, or 2.

Table D-1 Possible N:M Values for DEC

Nodes RF = 0 RF = 1 RF = 2 RF = 3 * RF = 4 *

1 1:0 — — — —

2 1:0 — — — —

3 1:0 1:1 — — —

4 1:0 2:1 — — —

5 1:0 3:1 2:2 — —

6 1:0 4:1 3:2 — —

7 1:0 5:1 4:2 3:3 —

8 1:0 6:1 5:2 4:3 —

9 1:0 7:1 6:2 5:3 4:4

10 1:0 8:1 7:2 6:3 5:4

11 1:0 8:1 8:2 7:3 6:4

12 1:0 9:1 9:2 8:3 7:4

13 1:0 9:1 9:2 9:3 8:4

14 1:0 10:1 10:2 10:3 9:4

15 1:0 10:1 10:2 11:3 10:4



* These RF options are not supported in COS Release 3.20.1.

Table D-2 shows the total parity overhead (as described in Defining Resiliency, page D-2) for a given number of nodes and resiliency factor for each of two LEC values, 12:1 and 12:2.

16 1:0 11:1 11:2 12:3 11:4

17 1:0 11:1 11:2 12:3 12:4

18 1:0 12:1 12:2 13:3 13:4

19 1:0 12:1 12:2 13:3 14:4

20 1:0 12:1 13:2 14:3 15:4

Table D-1 Possible N:M Values for DEC

Nodes RF = 0 RF = 1 RF = 2 RF = 3 * RF = 4 *

Table D-2 Total Parity Overhead for LEC 12:1 and LEC 12:2

NodesRF = 012:1

RF = 012:2

RF = 112:1

RF = 112:2

RF = 212:1

RF = 212:2

RF = 312:1

RF = 312:2

RF = 412:1

RF = 412:2

1 8% 17% — — — — — — — —

2 8% 17% — — — — — — — —

3 8% 17% 117% 134% — — — — — —

4 8% 17% 62% 76% — — — — — —

5 8% 17% 44% 56% 117% 134% — — — —

6 8% 17% 35% 46% 80% 95% — — — —

7 8% 17% 30% 41% 62% 76% 117% 134% — —

8 8% 17% 26% 37% 52% 64% 90% 105% — —

9 8% 17% 23% 34% 44% 56% 73% 87% 117% 134%

10 8% 17% 22% 32% 39% 51% 62% 76% 95% 111%

11 8% 17% 22% 32% 35% 46% 55% 67% 80% 95%

12 8% 17% 20% 30% 32% 43% 49% 61% 70% 84%

13 8% 17% 20% 30% 32% 43% 44% 56% 62% 76%

14 8% 17% 19% 29% 30% 41% 41% 52% 56% 69%

15 8% 17% 19% 29% 30% 41% 38% 49% 52% 64%

16 8% 17% 18% 28% 28% 38% 35% 46% 48% 60%

17 8% 17% 18% 28% 28% 38% 35% 46% 44% 56%

18 8% 17% 17% 27% 26% 37% 33% 44% 42% 53%

19 8% 17% 17% 27% 26% 37% 33% 44% 39% 51%

20 8% 17% 17% 27% 25% 35% 31% 42% 37% 48%


Appendix D Configuring Resiliency and Resiliency Groups Configuring Resiliency Groups

Replicating Objects During Swift Write OperationsWhile an object is being created or modified using Swift write operations, copies of the object data can be stored in real time on the local COS node and its peer nodes in the COS cluster. This functionality works only if the RAID feature on the node is disabled.

To disable the RAID feature on the node, open /arroyo/test/setupfile and set allow vault raid to 0.

To replicate object data on the node, open /arroyo/test/setupfile and set vault local copy count to a value greater than 1. This value specifies the how many copies of the object data are to be stored on the node.

To replicate object data on the peer nodes, open /arroyo/test/setupfile and set vault mirror copies to a value greater than 1. This value specifies how many remote copies of the object data are maintained.

Configuring Resiliency GroupsCOS Release 3.20.1 supports dividing the COS nodes in a cluster into one or more resiliency groups. A resiliency group is a group of servers that work together to support DEC by striping and distributing data internally within the group. Using resiliency groups to manage DEC reduces communication overhead for both the cluster and the site in general.

When using resiliency groups, every node in a COS cluster is defined as a member of a resiliency group. This logical subdivision of the cluster does not, however, interfere with the ability of the cluster to store a file object wherever it makes the most sense based on the resiliency configuration of the cluster and the capacity of the nodes in the cluster.

Once defined, resiliency groups are managed by software and are transparent to the user.

Related TerminologyTo briefly review the components of a COS installation:

• A node is a single COS server. Every node has a server ID that is unique within its cluster.

• A site consists of one or more COS servers that have the same array ID. A site is a user-defined subdivision of a cluster, and may (but need not) correspond to a geographic location.

• A cluster contains one or more COS servers with the same namespace and group ID. A cluster must have at least one COS server and exactly one associated Cassandra database cluster, and may also contain multiple sites. A cluster has a target resiliency that can be met using DEC or mirroring.

Design Considerations• Within a specific resiliency group, all server hardware should be of the same type.

• COS Release 3.20.1 does not support downgrade if more than one resiliency group is used.

• A resiliency group can contain a maximum of 20 COS nodes.

• When expanding, recommended practice is to fill one resiliency group before adding nodes to another resiliency group.



• The number of servers per resiliency group must support the DEC pattern. For example, if the DEC pattern is 2:1, there must be at least three servers in the resiliency group (and four would be ideal with DEC 2:1).

Defining Resiliency GroupsDefining a new resiliency group simply requires assigning an available resiliency group ID to each of the COS nodes to be included in the group. COS supports defining resiliency groups either via the V2PC GUI or via the setup file (use setupfile with MOS or VMP, and aftersetupfile with V2PC).

Note While it is possible to fully configure COS from the CLI, deployments typically use V2PC COS node profiles, in which case V2PC populates the data for the setup file through the COS node profile.

Using the V2PC GUI

To define resiliency groups using the V2PC GUI:

Step 1 Log in to the V2PC GUI and, from the navigation menu, choose Cisco Cloud Object Storage (COS) > Node Profiles.

Step 2 Locate a COS node to be included in the resiliency group and open its node profile for editing.

Step 3 Assign an available Resiliency Group ID to the node (valid options are 1-200) and save the changes.

Step 4 Repeat Steps 2-3 for all nodes to be assigned to the same resiliency group ID.

Using the Setup File

To define resiliency groups using the setup file:

Step 1 Open (or if not present, create) one of the following files:

• /arroyo/test/aftersetupfile if using V2PC

• /arroyo/test/setupfile if using MOS or VMP

Step 2 Add the following line to the file:

resilience group ID <value>

where <value> is a number between 1 and 200.

Example[root@cos18 ~]# cat /arroyo/test/setupfile# CServer core configuration. Changes to this file require# a server reboot.serverid 2groupid 3333arrayid 6666

streamer 0 vault 1 cache 0

bypass_isacheck 1



cos 1

management bond0e1000 adapters: maxrate 975igb adapters: maxrate 975ixgbe adapters: maxrate 9850e1000 2: streaming 1 fill 1 ip 10.93.232.19 tport 0 cport 0 tgid 0

allow tcp traffic 1disks 1model CDE465-4R4resilience group ID 1vault local copy count 1allow vault raid 0vault mirror copies 1allow server raid 0

Note The line allow server raid 0 must be set or cserver will not load.

Step 3 Save the changes to the file.

Resiliency Group Assignment

At boot-up, the local COS node goes through all the servers in the RemoteServers file and finds their Resilience Group ID. This request also tells the remote server the local server Resilience Group ID.

Read Requests

When a read request is received by a server, it will look up the object in Cassandra and locate the corresponding Resilience Group ID(s). If the local Resilience Group is in the list of IDs, the read can continue the same as it does in the current version of COS.

If the first Resilience Group in the list doesn’t actually have the desired data available, check the other resilience Groups one at a time until the desired data is located.

For the case where a different Resilience Group contains the desired object, check if the client supports being redirected to the host Resilience Group, and do so if it is supported.

For the case where read redirection isn’t supported:

1. Proxy read the information for the client.

2. Send a locate request to the servers in the Resilience Group we are connected to

3. Get back the location of the data stripes and their host server IP addresses.

4. Issue HTTP read requests to those servers to read the corresponding stripe data.

5. Once the stripe data is received, forward it to the client.



Write RequestsWhen a write request is received by a server, it will check if the current Resilience Group’s capacity is above average or below average. If the group is above average for capacity (more than 10%), it will choose another Resilience Group better suited to handle the write and forward the write request to a server in that Resilience Group.

If the client supports redirect, then the client write request can be redirected to the other Resilience Group.

If redirect isn’t supported, then that node will act as a proxy and forward the write request to the other Resilience Group.

Once a Resilience Group accepts a write request, it proceeds the same as in COS 3.12. When the goid value is written to Cassandra, the Resilience Group ID value will also be written so that the goid can be found in the future.

Resilience Group SmoothingThe Resilience Group Smoother (RGSmoother) is a background low-priority evaluator that monitors the relative disk space in each Resilience Group (RG) and moves content from one RG to another in an effort to balance the distribution of content in a cluster.

The ingest logic attempts to balance the placement of content across the RGs in a cluster in a manner that keeps the disk storage balanced.

However, if the cluster should become unbalanced (for example: due to more deletes occurring in one RG than the average), the RGSmoother will detect the imbalance and reposition content until each RG's disk usage is within a configured tolerance from the average disk usage of all RGs.

Evaluator

The RGSmoother is one of the lowest priority evaluators and defers resources to any higher priority operations, such as normal traffic and data recovery efforts. The logic provides mechanisms to periodically analyze the RGs in the cluster to determine if RG smoothing is needed, as well as mechanisms to walk through all objects looking for smoothing candidates.

RG Smooth logic

Smooth Logic determines when smoothing is needed and what objects should be repositioned. This is similar to the Remote Smoothing Logic within a Resilience Group.

RG Smoothing is driven by the Resilience Groups that have more than their fair share of content. Each RG within the cluster periodically computes the average available RG disk storage and compares the local RG to this average. RG Smoothing is triggered to occur in the following conditions:

1. The local RG's available disk storage is less than the average available disk storage minus the configured tolerance and at least one RG has more than the average available disk storage.

2. The local RG's available disk storage is less than or equal to the average available disk storage and no other RG has available disk storage less than the average available disk storage minus the configured tolerance and another RG has more than the average available disk storage.

Once RG Smoothing has been triggered, the RG Smooth Logic selects objects to reposition to the RG(s) with excess available disk storage.



Once the RGs are brought within tolerance but other RGs are still out of tolerance, the smoothing logic will favor smoothing from the RGs with the least available disk space to RGs with the most available disk space. This means that some RGs may stop smoothing for a period of time and resume at a later period. This process may cause periods of time waiting for the RG smoother on a given RG to be triggered to run.

RG Smoothing

The RGSmoother accomplishes the task of balancing by rewriting content using the same mechanisms employed by the Small Object Garbage Compaction (SOGC) evaluator.

The background repositioning of content is uniform and the SOGC evaluator naturally performs RG Smoothing for "free".

When the RGSmoother determines that an object needs to be repositioned, the RG Smooth Operation first determines a target RG based upon the previously computed average available disk storage. The RGSmoother then copies the content from a goid in the over full local RG into a new goid in the target under full RG. The existing SOGC copy logic has been modified to accept and honor a destination RG, as well as be modularized to allow consumption by both the SOGC evaluator and RGSmooth evaluator. Once the copy is successfully made, the RG Smooth Operation updates the database with the new location of the content and delete the old copy.

Configuring Resilience Group SmoothingTo configure Resilience Group Smoothing, click on Enable, located under the Resilience Group Smoothing Service State pull-down.

Enabling/Disabling RG Smoothing

To enable or disable the RG Smoother evaluator, allow resilience group smoother <1|0>, using the following:

proc: echo <1|0> >/proc/calypso/tunables/allowresiliencegroupsmoother

The default value is 0 (disabled).

Adjusting the RG Smoother Evaluation Interval

To adjust how often the RG Smoother evaluator is triggered to check to see if RG Smoothing is required, set the resilience group smoother interval <value>, using the following:

proc: echo <value> /proc/calypso/tunables/ resiliencegroupsmootherinterval

The Value is specified in seconds and must be at least 60 seconds. The default value is 7200 seconds (2 hours).

Adjusting RG Smoother Max Agents

To adjust the maximum number of RG Smoother agents that may execute concurrently, define the max resilience group smoother agents <value>, using the following:

proc: echo <value> >/proc/calypso/tunables/maxresiliencegroupsmootheragents

The value must be at least 1, with the default being 50.



Adjust RG Smoother Imbalance Tolerance:

If an RG is found to have available disk storage higher or lower than the average disk storage for all RGs by more than the tolerance, RG Smoothing will occur. To adjust the RG Smoother Imbalance Tolerance, define the resilience group smoother imbalance tolerance <value>, using the following:

proc: echo <value> >/proc/calypso/tunables/ resiliencegroupsmootherimbalancetolerance

Values must be between 1 and 100, inclusive, with a default value of 5.

RG Logging

RG Smoother Evaluator progress is logged to the evaluator log. Entries include interval check, starting evaluation, objects smoothed, and so on. RG Smoother Evaluator status is logged to the protocoltiming and objectRepair logs. The protocol timing log contains a single line showing the status of the RG Smoother. The objectRepair log contains multiple lines expressing additional status.

ProtocolTiming log example:

+RGSmoother Active=101 m=1 c=50 a=50 w=0 comp=1% obj=0.79/s

Where:

• Plus "+" indicates that the RG Smoother evaluator is actively checking objects. When not active an asterisk “*" is displayed.

• Active=101 indicates that 101 total work agents are active.

• m=1 indicates that more objects need to be checked.

• c=50 indicates that 50 check agents are active. Each check agent analyzes the next object in the list and either create a compaction agent (or wait if the max compaction agents are already active) or loop and check another object.

• a=50 indicates that 50 compaction agents are active. Each compaction agent is actively moving content.

• comp=%1 indicates that 1% of the objects in the file system have been evaluated by the RG Smoothing evaluator.

• obj=0.79/s indicates that the RG Smoother evaluator is processing 0.79 objects per second.

objectRepair log example:

RGSmooth: 101 more=1 chk=50 agnt=50 checked=3092 skipped=0 tried=3092 waits=0 Interval=60 Timer=50 Ave=74 (53 - 85) Tol=5 Cap=53 Need=1 Async: start(22) ok(1) act(1) chk(3092) skp(0) Evaluator: enabled(1) processinglist(0) stopping(0)

Where:

• RGSmooth: 101 indicates that 101 total work agents are active.

• more=1 indicates that more objects need to be checked.

• chk=50 indicates that 50 check agents are active.

• agnt=50 indicates that 50 compaction agents are active.

• interval=60 indicates the interval between RG Smoother evaluation attempts.



• Timer=50 indicates the number of seconds to the next evaluation attempt. If the RG Smoother evaluator is already active when a subsequent interval occurs, an additional smoothing pass will be queued to occur after the current evaluation completes.

• Ave=74 (53 - 85) indicates that the average available space across all RGs is 74%, the lowest available space on a RG is 53% and the highest available space on a RG is 85%.

• Tol=5 indicates the configured RG Smooth Imbalance Tolerance.

TroubleshootingA new proc file /proc/calypso/status/resiliencegroupjsoninfo is available on every server in the cluster. It shows the server's view of the resiliency groups defined for the cluster. These groups are displayed in JSON format for ease of parsing.

Each server’s json info file should be similar – containing the same resilience groups and associated servers – but the order of the entries may be different on different servers.

Example JSON File

{

"Resilience Groups":

[

{

"Resilience Group ID": "2",

"Status": "Normal",

"Partner Resilience Group ID": "none",

"Member Count": 3,

"Members":

[

{

"Server ID": 14,

"State": "Normal",

"Affinity Hint": 0,

"Anti-affinity Hint": 2

},

{

"Server ID": 13,

"State": "Normal",

"Affinity Hint": 0,

"Anti-affinity Hint": 1

. . . .


Appendix D Configuring Resiliency and Resiliency Groups Resilience Status

Resilience StatusResilience status provide status for individual nodes, each Resilience Group, and the Cluster (pod). This status provides the following:

• Resilience Evaluation – indicates when the last resilience evaluation was completed or whether it is still in progress. These evaluations are performed independently, on each node, and the information can be viewed at the Resilience Group and Cluster levels.

• Goid Counting – provides a count of the total number of Goids and Goids that could not be repaired.

• LEC and DEC Recovery Rates – provides data recovery rates for LEC and DEC recovery. The LEC information provides value at the node level, while the DEC information would be of value at both the node and the Resilience Group level. As part of this information gathering, each node needs to track a count of evaluated Goids and repaired Goids that occurred since the last evaluation cycle started (this information also tracks information for the last minute to measure the current rate).

• Resilience Evaluation Progress – tracks the progress of the resilience evaluation, tracking when the evaluation started, what percentage has been evaluated, and what percentage is waiting to be evaluated. This information, combined with the current evaluation rate, allows you to perform a rough estimate as to when the evaluation will be completed.

• Drive and Disk Space Availability – indicates whether or not sufficient drives and disk space are available on a node for LEC. Note that when running with LEC N:M, there must be at least M disk drives worth of free space.

• Resilience Group Servers and Disk Space – indicates whether or not sufficient servers and space are available in a resilience group for DEC. Note that when running with DEC N:M, there must be at least M servers worth of free space.

• Information Availability Medium – Resilience status information is available via proc and log files. Proc files can be used to query the current status of the node/Resilience Group/Cluster. Log files can be used to view the past status of the node and resilience groups.

/proc/calypso/status/cosnodestatus.json File"Server Resilience Status":{ "Resilience State": number, "Data Recovery Scope": number, "Disks Lost Since Last Data Recovery": number, "Data Recovery": { "Last Start Time": Seconds relative 1970, "Last Completion Time": Seconds relative 1970, "Status": "Active" / "Not Active", "Evaluated Goids": number, "Current Evaluation Rate": number / minute, "Target Completion Time": Seconds relative 1970 }, "DEC / Mirror Recovery": { "Last Start Time": Seconds relative 1970, "Last Completion Time": Seconds relative 1970, "Status": "Active" / "Not Active", "Evaluated Goids": number, "Current evaluation rate": number per minute, "Target Completion Time": Seconds relative 1970 }



"Total Goids": number, "Could Not Repair Goids": number}

"RG Resilience Status":{ "Resilience State": number, "Data Recovery Scope": number, "Servers In Data Recovery": number, "Servers Lost Since Last DEC Recovery": number, "DEC / Mirror Recovery": { "Last Start Time": Seconds relative 1970, "Last Completion Time": Seconds relative 1970, "Status": "Active" / "Not Active", "Evaluated Goids": number, "Current evaluation rate": number of per minute, "Target Completion Time”: Seconds relative 1970 }, "Total Goids": number, "Could Not Repair Goids": number}

Resilience Status Log Files

Note This log file is updated with a new set of statistics every minute. The log file displays statistics for the local server, all resilience groups, and for the cluster.

Node Resilience State

The following table provides a description of the states that are used to represent the health of the data, scoped to the node level.

Node Resilience State Metric

ValueCause / Meaning Severity

Recommended Administrative Action

0

LEC is not enabled or all drives are present. Data can be written in the configured resilient manner and the loss of a disk will not put the node into a compromised state.

ok N/A

1

The node has lost a drive, but data can be written in the configured resilient manner. Given x is the number of drives present, (LEC[n]+LEC[m] + 1 <= x).

warningReplace lost drives when able.

2

The node has less than LEC[n]+LEC[m] drives present but it does have at least LEC[n]+1 drives present. Data cannot be written on the node in the configured resilient manner but can be written with minimal resilience.

degradedReplace lost drives as soon as possible.



RG Resilience State

The following table provides a description of the states that are used to represent the health of the data scoped to the RG level.

Data Recovery Scope

The following table provides the various states representing the scope of data recovery and provides insight into the breadth of impact of component loss relative to the active data recovery.

3Given x is the number of drives present, LEC[n] >= x. Data cannot be written on the node in a resilient manner.

compromised

Replace lost drives immediately.

RG Resilience

State Metric Value

Cause / Meaning SeverityRecommended Administrative

Action

0DEC is not enabled or all nodes are present. The loss of a node will not put the RG into a compromised state.

ok N/A

1

At least one node has been lost, but data can be written in the configured resilient manner. Given x is the number of nodes online in an RG, (DEC[n]+DEC[m] <= x). There are nodes offline, but data can still be written in a resilient manner.

warningRecover lost nodes when able.

2

The number of nodes online in an RG is x, such that (DEC[n]+1 <= x) and (x < DEC[n]+DEC[m]). Data cannot be written in the RG at the configured resilience level but it can be written with minimal resilience.

degradedRecover lost nodes as soon as possible.

3The number of nodes online in an RG is x, such that (x <= DEC[n]). Data cannot be written in the RG in a resilient manner.

compromisedRecover lost nodes immediately.

Data Recovery Scope Metric Value

Scope Cause / Meaning Severity

0 N/A No active repairs anywhere in the cluster N/A

1rg or node

All components (disk drives, nodes, and RGs) are present in the cluster. The active repairs are completing unfinished parity generation work or restriping content.

ok

2 nodeSince the last successful completion of a repair cycle, LEC[m] or fewer drives per node have failed. warning



Miscellaneous Resilience Status

The following list provides a brief description of the information contained in the Resilience Status log file.

• DEC N:M configuration and status

• Data Recovery's last start time / last completion time / estimated completion time

• Mirror Recovery's last start time / last completion time / estimated completion time

• Evaluated Goid rate

• Total Goid count and could not be repaired goid count.

• Active server count for RG

• Total Disk Space

• Available Disk Space

• Disks lost since last successful Data Recovery evaluation

• Servers lost since last successful DEC / Mirror Recovery evaluation

3 rgSince the last successful completion of a repair cycle, more than LEC[m] drives have failed on any given node.

warning

4 rgSince the last successful completion of a repair cycle, DEC[m] or fewer nodes per rg have failed. warning

5 rgSince the last successful completion of a repair cycle, more than DEC[m] nodes have failed in any RG. compromised


Configuring Resiliency and Resiliency Groups · D-2 Cisco Cloud Object Storage Release 3.22.1 User...

Documents

Transcript of Configuring Resiliency and Resiliency Groups · D-2 Cisco Cloud Object Storage Release 3.22.1 User...