MinCopysets: Derandomizing Replication in Cloud Storage
description
Transcript of MinCopysets: Derandomizing Replication in Cloud Storage
MinCopysets: Derandomizing Replication in Cloud Storage
Stanford University
Asaf Cidon, Ryan Stutsman, Stephen Rumble,Sachin Katti, John Ousterhout and Mendel Rosenblum
Unpublished – Please do not distribute
Overview
Assumptions: no geo-replication, Azure uses much smaller clusters in practiceUnpublished – Please do not distribute
• Primary data stored on master (memory)• Divide each master’s data into chunks• Chunks are replicated on backups (disk)
– When master crashes, recover from thousands of backups
RAMCloud
Masters
Backups
CrashedMaster
Unpublished – Please do not distribute
Node 1 Node 2 Node 3 Node 4 Node 5
Node 6 Node 7 Node 8 Node 9 Node 10
Random Replication
Chunk 1 Chunk 2 Chunk 3
Chunk 1 Secondary
Chunk 1 Primary
Chunk 1 Secondary
Chunk 2 Secondary
Chunk 2 Secondary
Chunk 2 Primary
Chunk 3 Primary
Chunk 3 Secondary
Chunk 3 Secondary
Unpublished – Please do not distribute
The Problem
• Randomized replication loses data in power outages–0.5-1% of the nodes fail to reboot–1-2 times a year–Result: handful of chunks (GBs of data) are
unavailable (LinkedIn ‘12)• Sub-problem: managed power downs
–Software upgrades–Reduced power consumption
Unpublished – Please do not distribute
Intuition
• If we have one chunk, we are safe:– Replicate chunk on three nodes– Data is lost if failed nodes contain three copies of a
chunk– 1% of the nodes fail: 0.0001% of data loss
• If we have millions of chunks, we lose data:– 1000 node HDFS cluster has 10 million chunks– 1% of the nodes fail: 99.93% of data loss
Unpublished – Please do not distribute
Mathematical Intuition
• A copyset of nodes is a single unit of failure– Each chunk is replicated on a single copyset
• For one chunk, the probability of data loss is: – F = number of failed nodes– R = replication factor– N = number of nodes
• For all chunks, the probability is: – B = number of chunks
Unpublished – Please do not distribute
Changing R Doesn’t Help
Unpublished – Please do not distribute
Changing the Chunk Size Doesn’t Help
Unpublished – Please do not distribute
MinCopysets: Decouple Load Balancing and Durability
• Split nodes into fixed replication groups• Random Distribution: Place primary replica on
random node• Deterministic Replication: Place secondary
replicas deterministically on same replication group as primary
Unpublished – Please do not distribute
MinCopysets Architecture
Replication Group 3Replication Group 2Replication Group 1
Chunk 1 Chunk 2 Chunk 3 Chunk 4
Node 55
Chunk 1 Secondary
Chunk 3 Primary
Node 7
Chunk 1 Primary
Chunk 3 Secondary
Node 24
Chunk 1 Secondary
Chunk 3 Secondary
Node 2
Node 83 Node 8
Chunk 2 Secondary
Chunk 2 Secondary
Chunk 2 Primary
Node 1
Node 22 Node 47
Chunk 4 Primary
Chunk 4 Secondary
Chunk 4 Secondary
Unpublished – Please do not distribute
Unpublished – Please do not distribute
Unpublished – Please do not distribute
Unpublished – Please do not distribute
Extreme Failure Scenarios
• In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities
• For example:– 4000 node HDFS cluster– 120 nodes fail to reboot after power outage– Only 3.5% probability of data loss
Unpublished – Please do not distribute
Extreme Failure Scenarios: Normal Clusters
Unpublished – Please do not distribute
Extreme Failure Scenarios: Big Clusters
Unpublished – Please do not distribute
MinCopysets’ Trade Off
• Trades off frequency and magnitude of failures–Expected data loss is the same–Data loss occurs very rarely–The magnitude of data loss is greater
Unpublished – Please do not distribute
Frequency vs. Magnitude of Failures
• Setup:– 5000 node HDFS cluster– 3 TB per machine– R = 3– Power outage once a year
• Random replication– Lose 5.5 GB every single year
• MinCopysets– Lose data once every 625 years– Lose an entire node in case of failure
Unpublished – Please do not distribute
RAMCloud Implementation
• RAMCloud implementation was relatively straightforward
• Two non-trivial issues:1. Need to manage groups of nodes
• Allocate chunks on entire groups• Manage nodes joining and leaving groups
2. Machine failures are more complex• Need to re-replicate entire group, rather than
individual nodes
Unpublished – Please do not distribute
RAMCloud Implementation
RAMCloudCoordinator
RAMCloudMaster
RAMCloudBackup
Request:Assign Replication
Group RPC
Server ID ReplicationGroup ID
Server 0 5
Server 1 0
Server 2 5
Server 3 7
… …
Request:Open New Chunk RPC
Reply:Replication
Group
Coordinator Server List
Unpublished – Please do not distribute
HDFS Implementation
• Even simpler than RAMCloud• In HDFS replication decisions are centralized
on NameNode, in RAMCloud they are distributed– NameNode assigns DataNodes to replication
groups• Prototyped in 200 LoC
Unpublished – Please do not distribute
HDFS Issues
• Has the same issues as RAMCloud in managing groups of nodes
• Issue: Repair bandwidth– Solution: Hybrid scheme
• Issue: Network bottlenecks and load balancing– Solution: Kill replication group, re-replicate its data
elsewhere• Issue: Replication group’s capacity is limited by node
with the smallest capacity– Solution: Choose replication groups with similar capacities
Unpublished – Please do not distribute
Facebook’s HDFS Replication
• Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss
• Facebook’s Algorithm:– Primary replica is replicated on node j and rack k– Secondary replicas are replicated on randomly
selected nodes among (j+1,… ,j+5), on racks (k+1, k+2)
Unpublished – Please do not distribute
Facebook’s Replication
Unpublished – Please do not distribute
Hybrid MinCopysets
• Split nodes into replication groups of 2 and 15• First and second replica are always placed on
the group of 2• Third replica is randomly placed on the group
of 15
Thank You!
Stanford UniversityUnpublished – Please do not distribute