MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan...
-
Upload
tyree-trevillian -
Category
Documents
-
view
221 -
download
4
Transcript of MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan...
![Page 1: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/1.jpg)
MinCopysets: Derandomizing Replication in Cloud Storage
Stanford University
Asaf Cidon, Ryan Stutsman, Stephen Rumble,Sachin Katti, John Ousterhout and Mendel Rosenblum
Unpublished – Please do not distribute
![Page 2: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/2.jpg)
Overview
Assumptions: no geo-replication, Azure uses much smaller clusters in practiceUnpublished – Please do not distribute
![Page 3: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/3.jpg)
• Primary data stored on master (memory)• Divide each master’s data into chunks• Chunks are replicated on backups (disk)
– When master crashes, recover from thousands of backups
RAMCloud
Masters
Backups
CrashedMaster
Unpublished – Please do not distribute
![Page 4: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/4.jpg)
Node 1 Node 2 Node 3 Node 4 Node 5
Node 6 Node 7 Node 8 Node 9 Node 10
Random Replication
Chunk 1 Chunk 2 Chunk 3
Chunk 1 Secondary
Chunk 1 Primary
Chunk 1 Secondary
Chunk 2 Secondary
Chunk 2 Secondary
Chunk 2 Primary
Chunk 3 Primary
Chunk 3 Secondary
Chunk 3 Secondary
Unpublished – Please do not distribute
![Page 5: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/5.jpg)
The Problem
• Randomized replication loses data in power outages–0.5-1% of the nodes fail to reboot–1-2 times a year–Result: handful of chunks (GBs of data) are
unavailable (LinkedIn ‘12)• Sub-problem: managed power downs
–Software upgrades–Reduced power consumption
Unpublished – Please do not distribute
![Page 6: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/6.jpg)
Intuition
• If we have one chunk, we are safe:– Replicate chunk on three nodes– Data is lost if failed nodes contain three copies of a
chunk– 1% of the nodes fail: 0.0001% of data loss
• If we have millions of chunks, we lose data:– 1000 node HDFS cluster has 10 million chunks– 1% of the nodes fail: 99.93% of data loss
Unpublished – Please do not distribute
![Page 7: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/7.jpg)
Mathematical Intuition
• A copyset of nodes is a single unit of failure– Each chunk is replicated on a single copyset
• For one chunk, the probability of data loss is: – F = number of failed nodes– R = replication factor– N = number of nodes
• For all chunks, the probability is: – B = number of chunks
Unpublished – Please do not distribute
![Page 8: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/8.jpg)
Changing R Doesn’t Help
Unpublished – Please do not distribute
![Page 9: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/9.jpg)
Changing the Chunk Size Doesn’t Help
Unpublished – Please do not distribute
![Page 10: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/10.jpg)
MinCopysets: Decouple Load Balancing and Durability
• Split nodes into fixed replication groups• Random Distribution: Place primary replica on
random node• Deterministic Replication: Place secondary
replicas deterministically on same replication group as primary
Unpublished – Please do not distribute
![Page 11: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/11.jpg)
MinCopysets Architecture
Replication Group 3Replication Group 2Replication Group 1
Chunk 1 Chunk 2 Chunk 3 Chunk 4
Node 55
Chunk 1 Secondary
Chunk 3 Primary
Node 7
Chunk 1 Primary
Chunk 3 Secondary
Node 24
Chunk 1 Secondary
Chunk 3 Secondary
Node 2
Node 83 Node 8
Chunk 2 Secondary
Chunk 2 Secondary
Chunk 2 Primary
Node 1
Node 22 Node 47
Chunk 4 Primary
Chunk 4 Secondary
Chunk 4 Secondary
Unpublished – Please do not distribute
![Page 12: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/12.jpg)
Unpublished – Please do not distribute
![Page 13: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/13.jpg)
Unpublished – Please do not distribute
![Page 14: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/14.jpg)
Unpublished – Please do not distribute
![Page 15: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/15.jpg)
Extreme Failure Scenarios
• In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities
• For example:– 4000 node HDFS cluster– 120 nodes fail to reboot after power outage– Only 3.5% probability of data loss
Unpublished – Please do not distribute
![Page 16: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/16.jpg)
Extreme Failure Scenarios: Normal Clusters
Unpublished – Please do not distribute
![Page 17: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/17.jpg)
Extreme Failure Scenarios: Big Clusters
Unpublished – Please do not distribute
![Page 18: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/18.jpg)
MinCopysets’ Trade Off
• Trades off frequency and magnitude of failures–Expected data loss is the same–Data loss occurs very rarely–The magnitude of data loss is greater
Unpublished – Please do not distribute
![Page 19: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/19.jpg)
Frequency vs. Magnitude of Failures
• Setup:– 5000 node HDFS cluster– 3 TB per machine– R = 3– Power outage once a year
• Random replication– Lose 5.5 GB every single year
• MinCopysets– Lose data once every 625 years– Lose an entire node in case of failure
Unpublished – Please do not distribute
![Page 20: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/20.jpg)
RAMCloud Implementation
• RAMCloud implementation was relatively straightforward
• Two non-trivial issues:1. Need to manage groups of nodes
• Allocate chunks on entire groups• Manage nodes joining and leaving groups
2. Machine failures are more complex• Need to re-replicate entire group, rather than
individual nodes
Unpublished – Please do not distribute
![Page 21: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/21.jpg)
RAMCloud Implementation
RAMCloudCoordinator
RAMCloudMaster
RAMCloudBackup
Request:Assign Replication
Group RPC
Server ID ReplicationGroup ID
Server 0 5
Server 1 0
Server 2 5
Server 3 7
… …
Request:Open New Chunk RPC
Reply:Replication
Group
Coordinator Server List
Unpublished – Please do not distribute
![Page 22: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/22.jpg)
HDFS Implementation
• Even simpler than RAMCloud• In HDFS replication decisions are centralized
on NameNode, in RAMCloud they are distributed– NameNode assigns DataNodes to replication
groups• Prototyped in 200 LoC
Unpublished – Please do not distribute
![Page 23: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/23.jpg)
HDFS Issues
• Has the same issues as RAMCloud in managing groups of nodes
• Issue: Repair bandwidth– Solution: Hybrid scheme
• Issue: Network bottlenecks and load balancing– Solution: Kill replication group, re-replicate its data
elsewhere• Issue: Replication group’s capacity is limited by node
with the smallest capacity– Solution: Choose replication groups with similar capacities
Unpublished – Please do not distribute
![Page 24: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/24.jpg)
Facebook’s HDFS Replication
• Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss
• Facebook’s Algorithm:– Primary replica is replicated on node j and rack k– Secondary replicas are replicated on randomly
selected nodes among (j+1,… ,j+5), on racks (k+1, k+2)
Unpublished – Please do not distribute
![Page 25: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/25.jpg)
Facebook’s Replication
Unpublished – Please do not distribute
![Page 26: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/26.jpg)
Hybrid MinCopysets
• Split nodes into replication groups of 2 and 15• First and second replica are always placed on
the group of 2• Third replica is randomly placed on the group
of 15
![Page 27: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/27.jpg)
![Page 28: MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.](https://reader038.fdocuments.us/reader038/viewer/2022102900/551694e2550346f6208b4829/html5/thumbnails/28.jpg)
Thank You!
Stanford UniversityUnpublished – Please do not distribute