BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang,...

21
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu {ylzhang,zbzheng,lyu}@cse.cuhk.edu.hk Department of Computer Science & Engineering The Chinese University of Hong Kong Hong Kong, China School of Computer Science National University of Defence Technology Changsha, China CLOUD 2011, Washington DC, USA, July 4 - 9, 2011

Transcript of BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang,...

BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource

Cloud Computing

Yilei Zhang, Zibin Zheng, and Michael R. Lyu{ylzhang,zbzheng,lyu}@cse.cuhk.edu.hk

Department of Computer Science & EngineeringThe Chinese University of Hong Kong

Hong Kong, ChinaSchool of Computer Science

National University of Defence TechnologyChangsha, China

CLOUD 2011, Washington DC, USA, July 4 - 9, 2011

Outline

• Introduction• System Architecture• System Design• Experiments• Conclusion

2

Cloud Computing Cloud computing provides a model for enabling convenient, on-demand

network access to a shared pool of computing resources : Networks Servers Databases Services

Voluntary-resource infrastructure which consists of numerous user-contributed computing resources

3

Cloud Applications Building on a number of distributed cloud components

Large-scale Complicated Time sensitive High-quality

Case 1: New York Times Used EC2 and S3 to convert 15 million scanned news articles to

PDF (4TB data) 100 Linux computers 24 hours

Case 2: Nasdaq Uses S3 to deliver historic stock and fund information Millions of files showing price changes of entities over 10

minute segments

4

Reliability of Cloud Applications

• The reliability of cloud applications is greatly influenced by the reliability of cloud modules

• Traditional testing has limited improvement on the reliability of a cloud module under voluntary-resource cloud infrastructure:– Computing resources, denoted as nodes in the cloud, are

frangible– Communication links between modules are not reliable

• Our Goal: It is extremely urgent to design a fault tolerance mechanism for handling different faults under voluntary-resource cloud infrastructure

5

BFTCloud• BFTCloud uses replication techniques for

overcoming failures since a broad pool of nodes are available in the cloud

• BFTCloud guarantees robustness of systems when up to of totally 3 + 1 resource providers 𝑓 𝑓are faulty

• BFTCloud can tolerant different types of failures:– Crash– Network faults, like disconnection– Byzantine faults,like malicious behaviors– Etc…

6

System Architecture

7

System Architecture

8

Work Procedures of BFTCloud

9

Primary Selection

10

Replica Selection

• QoS Score: • Failure Probability of a BFT group:

• Replica selection problem:

11

Request Execution

1. The cloud module first forms a request sequence and sends the sequence of requests to the primary.

2. The primary will order the requests and forward the ordered requests to all the BFT group members.

3. Each member of the BFT group will execute the sequence of requests and send the corresponding responses back to the cloud module.

4. The cloud module collects all the received responses from the BFT group members and make a judgment on the consistence of responses.

12

Consistence Judgment• Case 1: The cloud module receives 3 + 1 consist 𝑓

responses from the BFT group. No fault happens in the current BFT group.

• Case 2: The cloud module receives between 2 +1 to 𝑓 3 𝑓 consist responses. Less than + 1 faults happened.𝑓

• Case 3: The cloud module receives less than 2 + 1 𝑓response messages. Either the primary is faulty or more than +1 replicas are faulty.𝑓

• Case 4: The cloud module receives more than 2 + 1 𝑓responses, but fewer than + 1 responses are 𝑓consistency. This indicates inconsistent ordering of requests by the primary

13

Primary Updating• A replica which suspects the primary to be faulty

sends an primary election proposal to all the other replicas.

• If a replica receives + 1 primary election 𝑓proposals, it indicates that the primary is really faulty.

• It will send a primary selection request to the cloud module.

• The cloud module then will start the primary selection phase and return a new primary which is one of the current replicas.

14

Replica Updating• The failure probability of a BFT group under the condition that a set of

replicas are already faulty is:

• The new BFT group , which can tolerate up to 𝜎′ 𝑓′nodes failure, should satisfy > 0. Therefore, the replica 𝑃𝜎′ 𝑃 updating problem is reduced to a replication degree decision problem:

15

Experimental Setup• We have implemented our BFTCloud approach by Java

language and deployed it as a middleware in a voluntary-resource cloud environment

• The cloud infrastructure is constructed by 257 distributed computers located in 26 countries from Planet-lab

• In our experiments, each node in the cloud is configured with a random malicious failure probability, which denotes the probability of arbitrary behavior happens in the node

• Each node keeps the QoS information of all the other nodes and updates the information periodically.

16

Performance Comparison• NoFT: No fault tolerance strategy is employed for task execution in

the voluntary-resource cloud.• Zyzzyva: A state-of-the-art Byzantine Fault tolerance approach. The

cloud module sends requests to a fixed primary and a group of replicas. There is no mechanism designed for adopting the dynamic voluntary-resource cloud environment.

• BFTCloud: The Byzantine Fault tolerance framework proposed in this paper. The cloud module mask faults and adopt the highly dynamic voluntary-resource environment.

• BFTRandom: The framework is the same with BFTCloud. However, this approach just randomly selects nodes in primary selection, replica selection, primary updating, and replica updating phases.

17

Experimental Results

18

Experimental Results

19

Conclusion

• We identify the Byzantine fault tolerance problem in voluntary-resource cloud and propose a Byzantine fault tolerance framework, named BFTCloud, for guaranteeing the robustness of cloud application

• We have implemented the BFTCloud system and test it on a voluntary-resource cloud

• We conduct large-scale real-world experiments to study the performance of BFTCloud on reliability improvement compared with other approaches

20

Thank you!

Email: [email protected]