Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford,...

20
Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong,Luiz Barroso, Carrie Grimes, and Sean Quinlan

Transcript of Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford,...

Page 1: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

Availability in Globally Distributed Storage Systems

Presented By Ala`a Ibrahim

1

Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong,Luiz Barroso, Carrie Grimes, and Sean Quinlan

Page 2: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

OUTLINE• Introduction

• Disks failures• Correlated Failures• Fault Tolerance MechanismsMarkov Model of Stripe Availability

•Markov Model Findings•Conclusions

Page 3: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

3

Data Center

Page 4: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

4

Data Center Components

Server Components

Racks

Interconnects

Cluster of Racks

Page 5: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

5

Data Center Components

Server Components

Racks

Interconnects

Cluster of Racks

ALL THESE COMPONENTS CAN FAIL

Page 6: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

6

Cell, Stripe and Chunk

Stripe 1 Stripe 2

Stripe 1 Stripe 2

CELL 1 CELL 2

ChunksChunks ChunksChunks

GFS Instance 1 GFS Instance 2

Page 7: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

7

Failure Sources• Failure Sources

• Hardware – Disks, Memory etc.• Software – chunk server process• Network Interconnect• Power Distribution Unit

• Availability• Reasons of unavailable

•Overloaded•Crash or restart•Hardware error•Automated repair processes

Page 8: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,
Page 9: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

Disks failures•Node restarts• Planned machine reboots•Unplanned machine reboots•Unknown

Page 10: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

10

Fault Tolerance Mechanisms• Replication (R = n)

• ‘n’ identical chunks (replication factor) are placed across storage nodes in different rack/cell/DC

• Erasure Coding ( RS (n, m))• ‘n’ distinct data blocks and ‘m’ code blocks• Can recover utmost ‘m’ blocks from the remaining

‘n-m’ blocks

Page 11: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

11

Replication

1 Chunk

5 replicas

Fast Encoding / Decoding

Very Space Inefficient

Page 12: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

12

Erasure Coding

‘n’ data blocks

Encode

‘n + m’ blocks

‘m’ code blocks

Page 13: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

13

Erasure Coding

‘n’ data blocks

Encode

‘n + m’ blocks

‘m’ code blocks

Page 14: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

14

Erasure Coding

Highly Space Efficient Slow Encoding / Decoding

‘n’ data blocks

Decode

Encode

‘n + m’ blocks

‘m’ code blocks

‘n’ data blocks

Page 15: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

15

Correlated Failures• Failure Domain

• Set of machines that simultaneously fails from a common source of failure

• Failure Burst• Sequence of node failures each occurring within a

time window ‘w’ of the next• Window 120 s

Page 16: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

Correlated Failures…• Failure Burst (Window Size)

Page 17: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

17

Markov Model• Chunk placement policy• Cell Simulation

• trace-based simulation• Priority queue

Page 18: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

18

Markov Chain

Page 19: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

19

Conclusion

• The findings provides a feedback for improving• Replication and encoding schemes• Recovery rate

Page 20: Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,