Byzantine Fault Tolerance in Stateful Web Service Yilei ZHANG 30/10/2009.
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang,...
-
Upload
brett-briggs -
Category
Documents
-
view
218 -
download
0
Transcript of BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang,...
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource
Cloud Computing
Yilei Zhang, Zibin Zheng, and Michael R. Lyu{ylzhang,zbzheng,lyu}@cse.cuhk.edu.hk
Department of Computer Science & EngineeringThe Chinese University of Hong Kong
Hong Kong, ChinaSchool of Computer Science
National University of Defence TechnologyChangsha, China
CLOUD 2011, Washington DC, USA, July 4 - 9, 2011
Cloud Computing Cloud computing provides a model for enabling convenient, on-demand
network access to a shared pool of computing resources : Networks Servers Databases Services
Voluntary-resource infrastructure which consists of numerous user-contributed computing resources
3
Cloud Applications Building on a number of distributed cloud components
Large-scale Complicated Time sensitive High-quality
Case 1: New York Times Used EC2 and S3 to convert 15 million scanned news articles to
PDF (4TB data) 100 Linux computers 24 hours
Case 2: Nasdaq Uses S3 to deliver historic stock and fund information Millions of files showing price changes of entities over 10
minute segments
4
Reliability of Cloud Applications
• The reliability of cloud applications is greatly influenced by the reliability of cloud modules
• Traditional testing has limited improvement on the reliability of a cloud module under voluntary-resource cloud infrastructure:– Computing resources, denoted as nodes in the cloud, are
frangible– Communication links between modules are not reliable
• Our Goal: It is extremely urgent to design a fault tolerance mechanism for handling different faults under voluntary-resource cloud infrastructure
5
BFTCloud• BFTCloud uses replication techniques for
overcoming failures since a broad pool of nodes are available in the cloud
• BFTCloud guarantees robustness of systems when up to of totally 3 + 1 resource providers 𝑓 𝑓are faulty
• BFTCloud can tolerant different types of failures:– Crash– Network faults, like disconnection– Byzantine faults,like malicious behaviors– Etc…
6
Replica Selection
• QoS Score: • Failure Probability of a BFT group:
• Replica selection problem:
11
Request Execution
1. The cloud module first forms a request sequence and sends the sequence of requests to the primary.
2. The primary will order the requests and forward the ordered requests to all the BFT group members.
3. Each member of the BFT group will execute the sequence of requests and send the corresponding responses back to the cloud module.
4. The cloud module collects all the received responses from the BFT group members and make a judgment on the consistence of responses.
12
Consistence Judgment• Case 1: The cloud module receives 3 + 1 consist 𝑓
responses from the BFT group. No fault happens in the current BFT group.
• Case 2: The cloud module receives between 2 +1 to 𝑓 3 𝑓 consist responses. Less than + 1 faults happened.𝑓
• Case 3: The cloud module receives less than 2 + 1 𝑓response messages. Either the primary is faulty or more than +1 replicas are faulty.𝑓
• Case 4: The cloud module receives more than 2 + 1 𝑓responses, but fewer than + 1 responses are 𝑓consistency. This indicates inconsistent ordering of requests by the primary
13
Primary Updating• A replica which suspects the primary to be faulty
sends an primary election proposal to all the other replicas.
• If a replica receives + 1 primary election 𝑓proposals, it indicates that the primary is really faulty.
• It will send a primary selection request to the cloud module.
• The cloud module then will start the primary selection phase and return a new primary which is one of the current replicas.
14
Replica Updating• The failure probability of a BFT group under the condition that a set of
replicas are already faulty is:
• The new BFT group , which can tolerate up to 𝜎′ 𝑓′nodes failure, should satisfy > 0. Therefore, the replica 𝑃𝜎′ 𝑃 updating problem is reduced to a replication degree decision problem:
15
Experimental Setup• We have implemented our BFTCloud approach by Java
language and deployed it as a middleware in a voluntary-resource cloud environment
• The cloud infrastructure is constructed by 257 distributed computers located in 26 countries from Planet-lab
• In our experiments, each node in the cloud is configured with a random malicious failure probability, which denotes the probability of arbitrary behavior happens in the node
• Each node keeps the QoS information of all the other nodes and updates the information periodically.
16
Performance Comparison• NoFT: No fault tolerance strategy is employed for task execution in
the voluntary-resource cloud.• Zyzzyva: A state-of-the-art Byzantine Fault tolerance approach. The
cloud module sends requests to a fixed primary and a group of replicas. There is no mechanism designed for adopting the dynamic voluntary-resource cloud environment.
• BFTCloud: The Byzantine Fault tolerance framework proposed in this paper. The cloud module mask faults and adopt the highly dynamic voluntary-resource environment.
• BFTRandom: The framework is the same with BFTCloud. However, this approach just randomly selects nodes in primary selection, replica selection, primary updating, and replica updating phases.
17
Conclusion
• We identify the Byzantine fault tolerance problem in voluntary-resource cloud and propose a Byzantine fault tolerance framework, named BFTCloud, for guaranteeing the robustness of cloud application
• We have implemented the BFTCloud system and test it on a voluntary-resource cloud
• We conduct large-scale real-world experiments to study the performance of BFTCloud on reliability improvement compared with other approaches
20