MinCopysets: Derandomizing Replication in Cloud Storage

MinCopysets: Derandomizing Replication in Cloud Storage

Stanford University

Asaf Cidon, Ryan Stutsman, Stephen Rumble,Sachin Katti, John Ousterhout and Mendel Rosenblum

Unpublished – Please do not distribute

Overview

Assumptions: no geo-replication, Azure uses much smaller clusters in practiceUnpublished – Please do not distribute

• Primary data stored on master (memory)• Divide each master’s data into chunks• Chunks are replicated on backups (disk)

– When master crashes, recover from thousands of backups

RAMCloud

Masters

Backups

CrashedMaster


Node 1 Node 2 Node 3 Node 4 Node 5

Node 6 Node 7 Node 8 Node 9 Node 10

Random Replication

Chunk 1 Chunk 2 Chunk 3

Chunk 1 Secondary

Chunk 1 Primary

Chunk 1 Secondary

Chunk 2 Secondary

Chunk 2 Secondary

Chunk 2 Primary

Chunk 3 Primary

Chunk 3 Secondary

Chunk 3 Secondary


The Problem

• Randomized replication loses data in power outages–0.5-1% of the nodes fail to reboot–1-2 times a year–Result: handful of chunks (GBs of data) are

unavailable (LinkedIn ‘12)• Sub-problem: managed power downs

–Software upgrades–Reduced power consumption


Intuition

• If we have one chunk, we are safe:– Replicate chunk on three nodes– Data is lost if failed nodes contain three copies of a

chunk– 1% of the nodes fail: 0.0001% of data loss

• If we have millions of chunks, we lose data:– 1000 node HDFS cluster has 10 million chunks– 1% of the nodes fail: 99.93% of data loss


Mathematical Intuition

• A copyset of nodes is a single unit of failure– Each chunk is replicated on a single copyset

• For one chunk, the probability of data loss is: – F = number of failed nodes– R = replication factor– N = number of nodes

• For all chunks, the probability is: – B = number of chunks


Changing R Doesn’t Help


Changing the Chunk Size Doesn’t Help


MinCopysets: Decouple Load Balancing and Durability

• Split nodes into fixed replication groups• Random Distribution: Place primary replica on

random node• Deterministic Replication: Place secondary

replicas deterministically on same replication group as primary


MinCopysets Architecture

Replication Group 3Replication Group 2Replication Group 1

Chunk 1 Chunk 2 Chunk 3 Chunk 4

Node 55

Chunk 1 Secondary

Chunk 3 Primary

Node 7

Chunk 1 Primary

Chunk 3 Secondary

Node 24

Chunk 1 Secondary

Chunk 3 Secondary

Node 2

Node 83 Node 8

Chunk 2 Secondary

Chunk 2 Secondary

Chunk 2 Primary

Node 1

Node 22 Node 47

Chunk 4 Primary

Chunk 4 Secondary

Chunk 4 Secondary


Extreme Failure Scenarios

• In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities

• For example:– 4000 node HDFS cluster– 120 nodes fail to reboot after power outage– Only 3.5% probability of data loss


Extreme Failure Scenarios: Normal Clusters


Extreme Failure Scenarios: Big Clusters


MinCopysets’ Trade Off

• Trades off frequency and magnitude of failures–Expected data loss is the same–Data loss occurs very rarely–The magnitude of data loss is greater


Frequency vs. Magnitude of Failures

• Setup:– 5000 node HDFS cluster– 3 TB per machine– R = 3– Power outage once a year

• Random replication– Lose 5.5 GB every single year

• MinCopysets– Lose data once every 625 years– Lose an entire node in case of failure


RAMCloud Implementation

• RAMCloud implementation was relatively straightforward

• Two non-trivial issues:1. Need to manage groups of nodes

• Allocate chunks on entire groups• Manage nodes joining and leaving groups

2. Machine failures are more complex• Need to re-replicate entire group, rather than

individual nodes


RAMCloud Implementation

RAMCloudCoordinator

RAMCloudMaster

RAMCloudBackup

Request:Assign Replication

Group RPC

Server ID ReplicationGroup ID

Server 0 5

Server 1 0

Server 2 5

Server 3 7

… …

Request:Open New Chunk RPC

Reply:Replication

Group

Coordinator Server List


HDFS Implementation

• Even simpler than RAMCloud• In HDFS replication decisions are centralized

on NameNode, in RAMCloud they are distributed– NameNode assigns DataNodes to replication

groups• Prototyped in 200 LoC


HDFS Issues

• Has the same issues as RAMCloud in managing groups of nodes

• Issue: Repair bandwidth– Solution: Hybrid scheme

• Issue: Network bottlenecks and load balancing– Solution: Kill replication group, re-replicate its data

elsewhere• Issue: Replication group’s capacity is limited by node

with the smallest capacity– Solution: Choose replication groups with similar capacities


Facebook’s HDFS Replication

• Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss

• Facebook’s Algorithm:– Primary replica is replicated on node j and rack k– Secondary replicas are replicated on randomly

selected nodes among (j+1,… ,j+5), on racks (k+1, k+2)


Facebook’s Replication


Hybrid MinCopysets

• Split nodes into replication groups of 2 and 15• First and second replica are always placed on

the group of 2• Third replica is randomly placed on the group

of 15

Thank You!

Stanford UniversityUnpublished – Please do not distribute

MinCopysets: Derandomizing Replication in Cloud Storage

Documents

Transcript of MinCopysets: Derandomizing Replication in Cloud Storage