Robustness in the Salus scalable block store
description
Transcript of Robustness in the Salus scalable block store
![Page 1: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/1.jpg)
Robustness in the Salus scalable block store
Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam,
Lorenzo Alvisi, and Mike DahlinUniversity of Texas at Austin
![Page 2: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/2.jpg)
Scalable and robust storage
More hardware More complex softwareMore failures
![Page 3: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/3.jpg)
Achieving both is hard
Scalable systems (GFS/Bigtable, HDFS/HBase, WAS, Spanner, FDS, …..)
Strong protections (End-to-end checks, BFT, Depot, …)
Challenge:
Read from 1 node
BFT: read from f+1 nodes
Consistency
Parallelismvs
![Page 4: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/4.jpg)
Salus
• Scalability:– Thousands of servers
• Robustness:– Tolerate disk/memory corruptions, CPU errors, …– Do NOT hurt performance/scalability.
• Usage:– Provide remote disks to users (Amazon EBS)
![Page 5: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/5.jpg)
Outline
• Challenges• Salus’ overview• Solutions
– Pipelined commit– Active storage– Scalable end-to-end checks
• Evaluation
![Page 6: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/6.jpg)
Challenge: Parallelism vs Consistency
Metadata server
Storage servers
Clients
Infrequent metadata transfer
Parallel data transfer
Data is replicated for durability and availability
State-of-the-art scalable architecture(GFS/Bigtable, HDFS/HBase, WAS, …)
![Page 7: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/7.jpg)
Challenges
• Write in parallel and in order• Eliminate single points of failure
– Write: prevent a single node from corrupting data– Read: read safely from one node
• Do not increase replication cost
![Page 8: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/8.jpg)
Write in parallel and in order
Metadata server
Data servers
Clients
Write 1 Write 2
Write 2 is committed but write 1 is not.Not allowed for block store.
![Page 9: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/9.jpg)
Prevent a single node from corrupting data
Metadata server
Data servers
Clients
Single point of failure
Computation nodes:• Data forwarding, garbage collection, etc• Tablet server (Bigtable), Region server (HBase), etc
![Page 10: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/10.jpg)
Read safely from one node
Metadata server
Data servers
Clients
Single point of failure
![Page 11: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/11.jpg)
Do not increase replication cost
• Industrial systems: – Write to f+1 nodes and read from one node
• BFT systems: – Write to 2f+1 nodes and read from f+1 nodes
![Page 12: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/12.jpg)
Outline
• Challenges• Salus’ overview• Solutions
– Pipelined commit– Active storage– Scalable end-to-end checks
• Evaluation
![Page 13: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/13.jpg)
Salus’ approach
Start from a scalable architecture (Bigtable/HBase)
Ensure robustness techniques do not hurt scalability
![Page 14: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/14.jpg)
Salus’ interface and model
• Disk-like interface:– A fixed number ….– Single writer– Barrier semantic
• Failure model:– Byzantine but not malicious
![Page 15: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/15.jpg)
Salus’ key ideas
• Pipelined commit – Guarantee ordering despite parallel writes
• Active storage– Prevent a computation node from corrupting data
• End-to-end verification – Read safely from one node
![Page 16: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/16.jpg)
Salus’ key ideas
Metadata server
Clients
Pipelined commit
Active storage
End-to-end verification
![Page 17: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/17.jpg)
Outline
• Challenges• Salus’ overview• Solutions
– Pipelined commit– Active storage– Scalable end-to-end checks
• Evaluation
![Page 18: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/18.jpg)
Pipelined commit
• Goal: barrier semantic– A request can be marked as a barrier.– All previous ones must be executed before it.
• Naïve solution:– The client blocks at a barrier: lose parallelism
• A weaker version of distributed transaction– Well-known solution: two phase commit (2PC)
![Page 19: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/19.jpg)
Pipelined commit – 2PC
1 2 3
4 5
1 3
2
4 5
Previous leader
PreparedCommitted
Client
Servers
Leader
Prepared
Leader
Batch i
Batch i+1
![Page 20: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/20.jpg)
Pipelined commit – 2PC
1 2 3
4 5
1 3
2
4 5
Previous leader
Batch i-1 committed
Client
Servers
Leader
Commit
Batch i committed
CommitLeader
Batch i
Batch i+1
![Page 21: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/21.jpg)
Pipelined commit - challenge
• Is 2PC slow?– Additional network messages? Disk is the bottleneck.– Additional disk write? Let’s eliminate that.– Challenge: whether to commit a write after recovery
1 3
22 is prepared. Should it be committed?Both cases are possible.
• Salus’ solution: ask other nodes
![Page 22: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/22.jpg)
Active Storage
• Goal: a single node cannot corrupt data• Well-known solution: BFT replication
– Problem: 2f+1 replication cost
• Salus’ solution: use f+1 replicas– Require unanimous consent of the whole quorum– How about availability if one replica fails?– If one replica fails, replace the whole quorum
![Page 23: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/23.jpg)
Active Storage
Computation node
Storage nodes
![Page 24: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/24.jpg)
Active StorageComputation nodes
Storage nodes
• Unanimous consent:– All updates must be agreed by f+1 computation nodes.
• Additional benefit: – Collocate computation and storage: save network bw
![Page 25: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/25.jpg)
Active StorageComputation nodes
Storage nodes
• What if one computation node fails?– Problem: we may not know which one is faulty.
• Replace the whole quorum
![Page 26: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/26.jpg)
Active StorageComputation nodes
Storage nodes
• What if one computation node fails?– Problem: we may not know which one is faulty.
• Replace the whole quorum– The new quorum must agree on the states.
![Page 27: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/27.jpg)
Active Storage
• Does it provide BFT with f+1 replication?• No ….• During recovery, may accept stale states if:
– The client fails;– At least one storage node provides stale states;– All other storage nodes are not available.
• 2f+1 replicas can eliminate this case:– Is it worth adding f replicas to eliminate that?
![Page 28: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/28.jpg)
End-to-end verification
• Goal: read safely from one node– The client should be able to verify the reply.– If corrupted, the client retries another node.
• Well-known solution: Merkle tree– Problem: scalability
• Salus’ solution:– Single writer– Distribute the tree among servers
![Page 29: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/29.jpg)
End-to-end verification
Server 1 Server 2 Server 3 Server 4
Client maintains the top tree.
Client does not need to store anything persistently.It can rebuild the top tree from the servers.
![Page 30: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/30.jpg)
Recovery
• Pipelined commit– How to ensure write order after recovery?
• Active storage:– How to agree on the current states?
• End-to-end verification– How to rebuild Merkle tree?
![Page 31: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/31.jpg)
Discussion – why HBase?
• It’s a popular architecture– Bigtable: Google– HBase: Facebook, Yahoo, …– Windows Azure Storage: Microsoft
• It’s open source.• Why two layers?
– Necessary if storage layer is append-only
• Why append-only storage layer? – Better random write performance– Easy to scale
![Page 32: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/32.jpg)
Discussion – multiple writers?
![Page 33: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/33.jpg)
Lessons
• Strong checking makes debugging easier.
![Page 34: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/34.jpg)
Outline
• Challenges• Salus’ overview• Solutions
– Pipelined commit– Active storage– Scalable end-to-end checks
• Evaluation
![Page 35: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/35.jpg)
Evaluation
![Page 36: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/36.jpg)
Evaluation
![Page 37: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/37.jpg)
Evaluation
![Page 38: Robustness in the Salus scalable block store](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681598a550346895dc6cd70/html5/thumbnails/38.jpg)
Read safely from one node
• Read is executed on one node:– Maximize parallelism– Minimize latency
• If that node experiences corruptions, …